4.4. Computational Issues

In the foregoing discussion, we have concentrated primarily on conceptual issues without paying much attention to a number of computational issues that must be dealt with in actually computing CCFs 94, 92, 87 .

1. Edge effects. Only at zero lag do all the points of the time series enter into the calculation of the correlation coefficient; at any other lag, points at the ends of the time series drop out of the calculation since there are no points in the other series to which they can be matched (see Fig. 24). This means that at larger and larger lags, fewer and fewer points contribute to the calculation of r. This has two significant consequences:

1. Normalization. Correct normalization of the CCF requires use of the correct value of the mean and standard deviation for each series, as can be seen from the definition of the correlation coefficient r (Eq. (31)). Since AGN time series are limited in duration, as points near the edges drop out of the calculation, the mean and standard deviation of the series change; in statistical language, this means that the series are "non-stationary." Thus, the mean and standard deviation of the series needs to be recalculated for every lag, using only the data points that actually contribute to the calculation. In the case where interpolated values from one series are used, the mean and standard deviation should be those of the interpolated points, not the original points.

2. Significance. Once the peak value of the CCF rpeak has been found, we want to know whether or not the correlation found is "statistically significant", i.e., is it likely to be real or spurious? The statistical significance of the correlation depends on the number of points that contribute to the calculation of rpeak, not to the total number of points in the series. Suppose, for example, that we have two time series consisting of N = 50 points, but that the maximum value of the CCF (say, rpeak = 0.45) occurs at a such a large lag that only N = 30 of the points are actually contributing at this lag. For a linear correlation coefficient of r = 0.45 and N = 30, the correlation is significant at about the 98% confidence level. However, if we erroneously use N = 50, we would conclude that the level of significance is about 99.9%, clearly a major difference. If we fail to account for the correct number of points contributing to the calculation of rpeak and simply use the number of points in the series, we will overestimate the significance of the correlations we detect.

2. Interpolation scheme. There are a couple of issues that arise in this regard:

1. Which series? In the examples shown in Fig. 24, we have interpolated in the emission-line light curve. Is there any particular reason to choose one series over the other when doing the interpolation? In general there does not seem to be, unless, for example, the emission-line light curve is markedly smoother than the continuum light curve (on account of the time-smearing effect) or one light curve is much better sampled than the other. Usually, one computes the CCF by computing the CCF twice, interpolating once in each series, and then averaging the results 28.

2. Interpolating function. Here we have considered only point-to-point linear interpolation, which results in first-derivative discontinuities in the light curves and the CCFs. Is there any advantage to using higher-order interpolation functions in the time series? Generally, no, higher-order functions don't improve the CCFs in any sense, and can be grossly misleading, as higher-order fits based on only a few data points become hard to control.

3. Resolution of the CCF. Can the accuracy of a cross-correlation result be better than the typical sampling interval? Yes, it can, as long as the functions involved are reasonably smooth. The analogy that is usually drawn is that one can measure image centroids to far higher accuracy than the image size, which is true because both stellar images and instrumental point-spread functions are generally smooth and symmetric. Statistical tests as described below suggest that uncertainties of about half the sampling interval are routinely obtained.