4.4. Computational Issues
In the foregoing discussion, we have concentrated primarily
on conceptual issues without paying much attention to a
number of computational issues that must be dealt with
in actually computing CCFs
^{94,
92,
87} .
- Edge effects. Only at zero lag do all the points
of the time series enter into the calculation of the
correlation coefficient; at any other lag, points at
the ends of the time series drop out of the calculation
since there are no points in the other series to
which they can be matched (see
Fig. 24).
This means that at larger and larger lags, fewer and
fewer points contribute to the calculation of r.
This has two significant consequences:
- Normalization. Correct normalization of the
CCF requires use of the correct value of the mean and
standard deviation for each series, as can be seen from
the definition of the correlation coefficient r
(Eq. (31)). Since AGN time series are limited
in duration, as points near the edges drop out of the
calculation, the mean and standard deviation of the
series change; in statistical language, this means
that the series are "non-stationary." Thus, the mean
and standard deviation of the series needs to be
recalculated for every lag, using only the data points
that actually contribute to the calculation. In the
case where interpolated values from one series are
used, the mean and standard deviation should be those
of the interpolated points, not the original points.
- Significance. Once the peak value of the
CCF r_{peak} has been found, we want to know
whether or not the correlation found is
"statistically significant", i.e., is it likely
to be real or spurious? The statistical significance
of the correlation depends on the number of points
that contribute to the calculation of r_{peak},
not to the total number of points in the series.
Suppose, for example, that we have two time series
consisting of N = 50 points, but that the maximum
value of the CCF (say, r_{peak} = 0.45)
occurs at a such a large lag that only N = 30 of the points are
actually contributing at this lag. For a linear correlation
coefficient of r = 0.45 and N = 30, the correlation is
significant at about the 98% confidence level. However,
if we erroneously use N = 50, we would conclude that the level of
significance is about 99.9%, clearly a major difference.
If we fail to account for the correct number of
points contributing to the calculation of r_{peak}
and simply use the number of points in the series, we will
overestimate the significance of the correlations we detect.
- Interpolation scheme. There are a couple of
issues that arise in this regard:
- Which series? In the examples shown in
Fig. 24, we have interpolated in the
emission-line light curve. Is there any particular reason
to choose one series over the other when doing the interpolation?
In general there
does not seem to be, unless, for example, the emission-line
light curve is markedly smoother than the continuum light
curve (on account of the time-smearing effect) or one light
curve is much better sampled than the other. Usually,
one computes the CCF by computing the CCF twice, interpolating
once in each series, and then averaging the results
^{28}.
- Interpolating function. Here we have considered
only point-to-point linear interpolation, which results in
first-derivative discontinuities in the light curves and
the CCFs. Is there any advantage to using higher-order
interpolation functions in the time series? Generally, no,
higher-order functions don't improve the CCFs in any sense,
and can be grossly misleading, as higher-order fits based
on only a few data points become hard to control.
- Resolution of the CCF. Can the accuracy of
a cross-correlation result be better than the typical
sampling interval? Yes, it can, as long as the functions
involved are reasonably smooth. The analogy that is usually
drawn is that one can measure image centroids to far higher
accuracy than the image size, which is true because
both stellar images and instrumental point-spread functions
are generally smooth and symmetric. Statistical tests as
described below suggest that uncertainties of about
half the sampling interval are routinely obtained.