Digital Sky Surveys: Software Tools and Technologies

1.1. Survey Pipeline Software

The actual gathering and processing of the raw survey data encompasses many steps, which can often be performed using a software pipeline. The first step involves hardware-specific data acquisition software, used to operate the telescopes and the instruments themselves. In principle this is not very different from the general astronomical software used for such purposes, except that the sky surveying modes tend to require a larger data throughput, a very stable and reliable operation over long stretches of time, and greater data flows than is the case for most astronomical observing. In most cases, additional flux calibration data are taken, possibly with separate instruments or at different times. Due to the long amounts of time required to complete a survey (often many years), a great deal of care must be exercised to monitor the overall performance of the survey in order to ensure a uniform data quality.

The next step is the removal of instrumental effects, e.g., flat-fielding of CCD images, subtraction of dark current for infrared detectors. Other than the sheer size of the data, this process is extremely similar to the usual astronomical observational data reduction techniques.

At this point one does some kind of automated source detection on individual survey images, be it CCD frames, drift scans, photographic plate scans, their subsections, or whatever other ``raw'' image format is being generated by the survey. This process requires a good understanding of the noise properties, which determines some kind of a detection significance threshold: one wants to go as deep as possible, but not count the noise peaks. In other words, maximize the completeness (the fraction of real sources detected) while minimizing the contamination (the fraction of noise peaks mistaken for real sources). Typically one aims for the completeness levels of at least 90%, and contamination of less than 10% in the first pass, and typically the source catalogs are purified further at subsequent processing steps.

Most source detection algorithms require a certain minimum number of adjacent or connected pixels above some signal-to-noise thresholds for detection. The optimal choice of these thresholds depends on the power spectrum of the noise. In many cases, the detection process involves some type of smoothing or optimal filtering, e.g., with a Gaussian whose width approximates that of an unresolved point source. Unfortunately, this also builds in a preferred scale for source detection, usually optimized for the unresolved sources (e.g., stars) or the barely-resolved ones (e.g., faint galaxies), which are the majority. This is a practical solution, but with the obvious selection biases, with the detection of sources depending not only on their flux, but also on their shape or contrast: there is almost always a limiting surface brightness (averaged over some specific angular scale) in addition to the limiting flux. The subject of possible missing large populations of low surface brightness galaxies has been debated extensively in the literature. The truth is that a surface brightness bias is always present at some level, whether it is actually important or not. Novel approaches to source, or more accurately, structure detection involve so-called multi-scale techniques.

Once individual sources are detected, a number of structural parameters are measured for them, including fluxes in a range of apertures, various diameters, radial moments of the light distribution, etc., from which a suitably defined, intensity-weighted centroid is computed. In most cases, the sky background intensity level is determined locally, e.g., in a large aperture surrounding each source; crowding and contamination by other nearby sources can present problems and create detection and measurement biases. Another difficult problem is deblending or splitting of adjacent sources, typically defined as a number of distinct, adjacent intensity peaks connected above the detection surface brightness threshold. A proper approach keeps track of the hierarchy of split objects, usually called the parent object (the blended composite), the children objects (the first level splits), etc. Dividing the total flux between them and assigning other structural parameters to them are nontrivial issues, and depend on the nature of the data and the intended scientific applications.

Object detection and parameter measurement modules in survey processing systems often use (or are based on) some standard astronomical program intended for such applications, e.g., FOCAS, SExtractor, or DAOPHOT, to mention just a few of the programs often used in 1990's. Such programs are well documented in the literature. Even if custom software is developed for these tasks, the technical issues are very similar. It is generally true that all such systems are built with certain assumptions about the properties of sources to be detected and measured, and optimized for a particular purpose, e.g., detection of faint galaxies, or accurate stellar photometry. Such data may serve most users well, but there is always a possibility that a custom reprocessing for a given scientific purpose may be needed.

At this point (or further down the line) astrometric and flux calibrations are applied to the data, using the measured source positions and instrumental fluxes. Most surveys are designed so that improved calibrations can be reapplied at any stage. In some cases, it is better to apply such calibration after the object classification (see below), as the transformations may be different for the unresolved and the resolved sources. Once the astrometric solutions are applied, catalogs from adjacent or overlapping survey images can be stitched together.

Object classification, e.g., as stars or galaxies in the visible and near-IR surveys, but more generally as resolved and unresolved sources, is one of the key issues. Classification of objects is an important aspect of characterizing the astrophysical content of a given sky survey, and for many scientific applications one wants either stars (i.e.,, unresolved objects) or galaxies; consider for example studies of the Galactic structure and studies of the large-scale structure in the universe. More detailed morphological classification, e.g., Hubble types of detected galaxies, may be also performed if the data contain sufficient discriminating information to enable it. Given the large data volumes involved in digital sky surveys, object classification must be automated, and in order to make it really useful, it has to be as reliable and objective as possible, and homogeneous over the entire survey.

In most cases, object classification is based on some quantitative measurements of the image morphology for the detected sources. For example, star-galaxy separation in optical and near-IR surveys uses th fact that all stars (and also quasars) would be unresolved point sources, and that the observed shape of the light distribution would be given by the point-spread function, whereas galaxies would be more extended. This may be quantified through various measures of the object radial shape, e.g., moments of the light distribution in various combinations. The problem of star-galaxy separation thus becomes a problem of defining a boundary in some parameter space of observed object properties, which would divide the two classes. In simplest approaches such a dividing line or surface is set empirically, but more sophisticated techniques use artificial intelligence methods, such as the Artificial Neural Nets or Decision Trees (artificial induction software). They require a training data set of objects for which the classification is known accurately from some independent observations. Because of this additional information input, such techniques can outperform the methods where survey data alone are used to decide on the correct object classifications.

There are several practical problems in this task. First, fainter galaxies are smaller in angular extent, thus approaching stars in their appearance. At the fainter flux levels the measurements are noisier, and thus the two types of objects become indistinguishable. This sets a classification limit to most optical and near-IR surveys, which is typically at a flux level a few times higher than the detection limit. Second, the shape of the point-spread function may vary over the span of the survey, e.g., due to the inevitable seeing variations. This may be partly overcome by defining the point-spread function locally, and normalizing the structural parameters of objects so that the unresolved sources are the same over the entire survey. In other words, one must define the unresolved source template which would be true locally, but may (and usually does) vary globally. Furthermore, this has to be done automatically and reliably over the entire survey data domain, which may be very heterogeneous in depth and intrinsic resolution. Additional problems include object blending, saturation of signal at bright flux levels, detector nonlinearities, etc.

The net result is that the automated object classification process is always stochastic in nature. Accuracies better than 90% are usually required, but accuracies higher than about 95% are generally hard to achieve, especially at faint flux levels.

In other situations, e.g., where the angular resolution of the data is poor, or where nonthermal processes are dominant generators of the observed flux, morphology of the objects may have little meaning, and other approaches are necessary. Flux ratios in different bandpasses, i.e., the spectrum shape, may be useful in separating different physical classes of objects.