Sky Surveys - S. G. Djorgovski et al.

5. FROM THE RAW DATA TO SCIENCE-READY ARCHIVES

Surveys, being highly data-intensive ventures where uniformity of data products is very important, pose a number of data processing and analysis challenges (Djorgovski & Brunner 2001). Broadly speaking, the steps along the way include: obtaining the data (telescope, observatory and instrument control software), on-site data processing, if any, detection of discrete sources and measurements of their parameters in imaging surveys, or extraction of 1-dimensional spectra and measurements of the pertinent features in the spectroscopic ones, data calibrations, archiving and dissemination of the results, and finally the scientific analysis and exploration. Typically, there is a hierarchy of ever more distilled and value-added data products, starting from the raw instrument output and ending with ever more sophisticated descriptions of detected objects.The great diversity of astronomical instruments and types of data with their specific processing requirements are addressed elsewhere in these volumes. Likewise, the data archives, virtual observatory and astroinformatics issues, data mining and the problem-specific scientific analysis are beyond the scope of this review. Here we address the intermediate steps that are particularly relevant for the processing and dissemination of survey data.

Many relevant papers for this subject can be found in the Astronomical Data Analysis and Software Systems (ADASS) and Astronomical Data Analysis (ADA) conference series, and in the SPIE volumes that cover astronomical instruments and software. Another useful reference is the volume edited by Graham, Fitzpatrick, & McGlynn (2008).

5.1. Data Processing Pipelines

The actual gathering and processing of the raw survey data encompasses many steps, which can be often performed using a dedicated software pipeline that is usually optimized for the particular instrument, and for the desired data output; that by itself may introduce some built-in biases, but if the original raw data are kept, they can always be reprocessed with improved or alternative pipelines.

Increasingly, we see surveys testing their pipelines extensively with simulated data, well before the actual hardware is built. This may reflect a cultural influence of the high-energy physics, as hey are increasingly participating in the major survey projects, and data simulations are essential in their field. However, one cannot simulate the problems that are discovered only when the real data are flowing.

The first step involves hardware-specific data acquisition software, used to operate the telescopes and the instruments themselves. In principle, this is not very different from the general astronomical software used for such purposes, except that the sky surveying generally requires a larger data throughput, a very stable and reliable operation over long stretches of time, and considerably greater data flows than is the case for most astronomical observing. In most cases, additional flux calibration data are taken, possibly with separate instruments or at different times. Due to the long amounts of time required to complete a survey (often several years), a great deal of care must be exercised to monitor the overall performance of the survey in order to ensure a uniform data quality.

Once the raw images, spectra, data cubes, or time series are converted in a form that has the instrumental signatures removed, and the data are represented as a linear intensity as a function of the spatial coordinates, wavelength, time, etc., the process of source detection and characterization starts. This requires a good understanding of the instrumental noise properties, which determines some kind of a detection significance threshold: one wants to go as deep as possible, but not count the noise peaks. In other words, we always try to maximize the completeness (the fraction of the real sources detected) while minimizing the contamination (the fraction of the noise peaks mistaken for real sources). In the linear regime of a given detector, the former should be as close to unity, and the latter as close to zero as possible. Both deteriorate at the fainter flux levels as the S/N drops. Typically, a detection limit is taken as a flux level where the completeness falls below 90% or so, and contamination increases above 10% or so. However, significant detections actually occur at some higher flux level.

Most source detection algorithms require a certain minimum number of adjacent or connected pixels above some signal-to-noise thresholds for detection. The optimal choice of these thresholds depends on the power spectrum of the noise. In many cases, the detection process involves some type of smoothing or optimal filtering, e.g., with a Gaussian whose width approximates that of an unresolved point source. Unfortunately, this also builds in a preferred scale for source detection, usually optimized for the unresolved sources (e.g., stars) or the barely-resolved ones (e.g., faint galaxies), which are the majority. This is a practical solution, but with the obvious selection biases, with the detection of sources depending not only on their flux, but also on their shape or contrast: there is almost always a limiting surface brightness (averaged over some specific angular scale) in addition to the limiting flux. A surface brightness bias is always present at some level, whether it is actually important or not for a given scientific goal. Novel approaches to source, or, more accurately, structure detection involve so-called multi-scale techniques (e.g., Aragon-Calvo et al. 2007).

Once individual sources are detected, a number of photometric and structural parameters are measured for them, including fluxes in a range of apertures, various diameters, radial moments of the light distribution, etc., from which a suitably defined, intensity-weighted centroid is computed. In most cases, the sky background intensity level is determined locally, e.g., in a large aperture surrounding each source; crowding and contamination by other nearby sources can present problems and create detection and measurement biases. Another difficult problem is deblending or splitting of adjacent sources, typically defined as a number of distinct, adjacent intensity peaks connected above the detection surface brightness threshold. A proper approach keeps track of the hierarchy of split objects, usually called the parent object (the blended composite), the children objects (the first level splits), and so on. Dividing the total flux between them and assigning other structural parameters to them are nontrivial issues, and depend on the nature of the data and the intended scientific applications.

Object detection and parameter measurement modules in survey processing systems often use (or are based on) some standard astronomical program intended for such applications, e.g., FOCAS (Jarvis & Tyson 1981), SExtractor (Bertin & Arnouts 1996), or DAOPHOT (Stetson 1987), to mention just a few popular ones. Such programs are well documented in the literature. Many surveys have adopted modified versions of these programs, optimized for their own data and scientific goals.

Even if custom software is developed for these tasks, the technical issues are very similar. It is generally true that all such systems are built with certain assumptions about the properties of sources to be detected and measured, and optimized for a particular purpose, e.g., detection of faint galaxies, or accurate stellar photometry. Such data may serve most users well, but there is always a possibility that a custom reprocessing for a given scientific purpose may be needed.

At this point (or further down the line) astrometric and flux calibrations are applied to the data, using the measured source positions and instrumental fluxes. Most surveys are designed so that improved calibrations can be reapplied at any stage. In some cases, it is better to apply such calibration after the object classification (see below), as the transformations may be different for the unresolved and the resolved sources. Once the astrometric solutions are applied, catalogs from adjacent or overlapping survey images can be stitched together.

In the mid-1990's, the rise of the TB-scale surveys brought the necessity of dedicated, optimized, highly automated pipelines, and databases to store, organize and access the data. One example is DPOSS, initially processed using SKICAT (Weir et al. 1995c), a system that incorporated databases and machine learning, which was still a novelty at that time. SDSS developed a set of pipelines for the processing and cataloguing of images, their astrometric and photometric calibration, and for the processing of spectra; additional, specialized pipelines were added later, to respond to particular scientific needs. For more details, see, e.g., York et al. (2000), Lupton et al. (2001), Stoughton et al. (2002), and the documentation available at the SDSS website.

A major innovation of SDSS (at least for the ground-based data; NASA missions and data centers were also pioneering such practices in astronomy) was the effective use of databases for data archiving and the Web-based interfaces for the data access, and in particular the SkyServer (Szalay, Gray, et al. 2001, 2002). Multiple public data releases were made using this approach, with the last one (DR8) in 2011, covering the extensions (SDSS-II and SDSS-III) of the original survey. By the early 2000's, similar practices were established as standard for most other surveys; for example, the UKIDSS data processing is described by Dye et al. (2006) and Hodgkin et al. (2009).

Synoptic sky survey added the requirement of data processing in real time, or as close to it as possible, so that transient events can be identified and followed in a timely fashion. For example, the PQ survey (2003-2008) had 3 independent pipelines, a traditional one at Yale University (Andrews et al. 2008), an image subtraction pipeline optimized for SN discovery at the LBNL Nearby Supernova Factory (Aldering et al. 2002), and the Palomar-Quest Event Factory (Djorgovski et al. 2008) pipeline at Caltech (2005-2008), optimized for a real-time discovery of transient events. The latter served as a basis for the CRTS survey pipeline (Drake et al. 2009). Following in the footsteps of the PQ survey, the PTF survey operates in a very similar manner, with the updated version of the NSNF near-real-time image subtraction pipeline for discovery of transients, and a non-time-critical pipeline for additional processing at IPAC.

An additional requirement for the synoptic sky surveys is a timely and efficient dissemination of transient events, now accomplished through a variety of electronic publishing mechanisms. Perhaps the first modern example was the Gamma-Ray Coordinates Network (GCN; Barthelmy et al. 2000; http://gcn.gsfc.nasa.gov), that played a key role in cracking the puzzle of the GRBs. For the ground-based surveys, the key effort was the VOEventNet (VOEN; Williams & Seaman 2006; http://voeventnet.caltech.edu), that developed the VOEvent, now an adopted standard protocol for astronomical event electronic publishing and communication, and deployed it in an experimental robotic telescope network with a feedback, using the PQEF as a primary testbed. This effort currently continues through the SkyAlert facility (http://skyalert.org; Williams et al. 2009), that uses the CRTS survey as its primary testbed. A variety of specific event dissemination mechanism have been deployed, using the standard webpages, RSS feeds, and even the mobile computing and social media.

5.2. Source and Event Classification

Object classification, e.g., as stars or galaxies in the visible and near-IR surveys, but more generally as resolved and unresolved sources, is one of the key issues. Classification of objects is an important aspect of characterizing the astrophysical content of a given sky survey, and for many scientific applications one wants either stars (i.e., unresolved objects) or galaxies; consider for example studies of the Galactic structure and studies of the large-scale structure in the universe. More detailed morphological classification, e.g., Hubble types of detected galaxies, may be also performed if the data contain sufficient discriminating information to enable it. Given the large data volumes involved in digital sky surveys, object classification must be automated, and in order to make it really useful, it has to be as reliable and objective as possible, and homogeneous over the entire survey. Often, the classification limit is more relevant than the detection limit, for definition of statistical samples of sources (e.g., stars, galaxies, quasars).

In most cases, object classification is based on some quantitative measurements of the image morphology for the detected sources. For example, star-galaxy separation in optical and near-IR surveys uses the fact that all stars (and also quasars) would be unresolved point sources, and that the observed shape of the light distribution would be given by the point-spread function, whereas galaxies would be more extended. This may be quantified through various measures of the object radial shape or concentration, e.g., moments of the light distribution in various combinations. The problem of star-galaxy separation thus becomes a problem of defining a boundary in some parameter space of observed object properties, which would divide the two classes. In simplest approaches such a dividing line or surface is set empirically, but more sophisticated techniques use artificial intelligence methods, such as the Artificial Neural Nets or Decision Trees (e.g., Weir et al. 1995a, Odewahn et al. 2004, Ball et al. 2006, Donalek et al. 2008). They require a training data set of objects for which the classification is known accurately from some independent observations. Because of this additional information input, such techniques can outperform the methods where survey data alone are used to decide on the correct object classifications.

There are several practical problems in this task. First, fainter galaxies are smaller in angular extent, thus approaching stars in their appearance. At the fainter flux levels the measurements are noisier, and thus the two types of objects become indistinguishable. This sets a classification limit to most optical and near-IR surveys, which is typically at a flux level a few times higher than the detection limit. Second, the shape of the point-spread function may vary over the span of the survey, e.g., due to the inevitable seeing variations. This may be partly overcome by defining the point-spread function locally, and normalizing the structural parameters of objects so that the unresolved sources are the same over the entire survey. In other words, one must define the unresolved source template that would be true locally, but may (and usually does) vary globally. Furthermore, this has to be done automatically and reliably over the entire survey data domain, which may be very heterogeneous in depth and intrinsic resolution. Additional problems include object blending, saturation of signal at bright flux levels, detector nonlinearities, etc., all of which modify the source morphology, and thus affect the classification.

The net result is that the automated object classification process is always stochastic in nature. Classification accuracies better than 90% are usually required, but accuracies higher than about 95% are generally hard to achieve, especially at faint flux levels.

In other situations, e.g., where the angular resolution of the data is poor, or where nonthermal processes are dominant generators of the observed flux, morphology of the objects may have little meaning, and other approaches are necessary. Flux ratios in different bandpasses, i.e., the spectrum shape, may be useful in separating different physical classes of objects.

A much more challenging task is the automated classification of transient events discovered in synoptic sky surveys (Djorgovski et al. 2006, 2011b, Mahabal et al. 2005, 2008a, b, Bloom et al. 2012). Physical classification of the transient sources is the key to their interpretation and scientific uses, and in many cases scientific returns come from the follow-up observations that depend on scarce or costly resources (e.g., observing time at larger telescopes). Since the transients change rapidly, a rapid (as close to the real time as possible) classification, prioritization, and follow-up are essential, the time scale depending on the nature of the source, which is initially unknown. In some cases the initial classification may remove the rapid-response requirement, but even an archival (i.e., not time-critical) classification of transients poses some interesting challenges.

This entails some special challenges beyond traditional automated classification approaches, which are usually done in some feature vector space, with an abundance of self-contained data derived from homogeneous measurements. Here, the input information is generally sparse and heterogeneous: there are only a few initial measurements, and the types differ from case to case, and the values have differing variances; the contextual information is often essential, and yet difficult to capture and incorporate in the classification process; many sources of noise, instrumental glitches, etc., can masquerade as transient events in the data stream; new, heterogeneous data arrive, and the classification must be iterated dynamically. Requiring a high completeness, a low contamination, and the need to complete the classification process and make an optimal decision about expending valuable follow-up resources (e.g., obtain additional measurements using a more powerful instrument at a certain cost) in real time are challenges that require some novel approaches.

The first challenge is to associate classification probabilities that any given event belongs to a variety of known classes of variable astrophysical objects and to update such classifications as more data come in, until a scientifically justified convergence is reached. Perhaps an even more interesting possibility is that a given transient represents a previously unknown class of objects or phenomena, that may register as having a low probability of belonging to any of the known data models. The process has to be as automated as possible, robust, and reliable; it has to operate from sparse and heterogeneous data; it has to maintain a high completeness (not miss any interesting events) yet a low false alarm rate; and it has to learn from the past experience for an ever improving, evolving performance.

Much of the initial information that may be used for the event classification is archival, implying a need for a good VO-style infrastructure. Much of the relevant information is also contextual: for example, the light curve and observed properties of a transient might be consistent with both it being a cataclysmic variable star, a blazar, or a supernova. If it is subsequently known that there is a galaxy in close proximity, the supernova interpretation becomes much more plausible. Such information, however, can be characterized by high uncertainty and absence, and by a rich structure - if there were two candidate host galaxies, their morphologies, distances, and luminosities become important, e.g., is this type of supernova more consistent with being in the extended halo of a large spiral galaxy or in close proximity to a faint dwarf galaxy? The ability to incorporate such contextual information in a quantifiable fashion is essential. There is a need to find a means of harvesting the human pattern recognition skills, especially in the context of capturing the relevant contextual information, and turning them into machine-processible algorithms.

These challenges are still very much a subject of an ongoing research. Some of the relevant papers and reviews include Mahabal et al. (2010a, b, c), Djorgovski et al. (2011b), Richards et al. (2011), and Bloom & Richards (2012), among others.

5.3. Data Archives, Analysis and Exploration

In general, the data processing flow is from the pixel (image) domain to the catalog domain (detected sources with measured parameters). This usually results in a reduction of the data volume by about an order of magnitude (this factor varies considerably, depending on the survey or the data set), since most pixels do not contain statistically significant signal from resolved sources. However, the ability to store large amounts of digital image information on-line opens up interesting new possibilities, whereby one may want to go back to the pixels and remeasure fluxes or other parameters, on the basis of the catalog information. For example, if a source was detected (i.e., cataloged) in one bandpass, but not in another, it is worth checking if a marginal detection is present even if it did not make it past the statistical significance cut the first time; even the absence of flux is sometimes useful information.

Once all of the data has been extracted from the image pixels by the survey pipeline software, it must be stored in some accessible way in order to facilitate scientific exploration. Simple user file systems and directories are not suitable for really large data volumes produced by sky surveys. The transition to Terascale data sets in the 1990's necessitated use of dedicated database software. Using a database system provides significant advantages (e.g., powerful and complex query expressions) combined with a rapid data access. Fortunately, commercially available database systems can be adopted for astronomical uses. Relational databases accessed using the Structured Query Language (SQL) tend to dominate at this time, but different architectures may be better scaleable for the much larger data volumes in the future.

A good example of a survey archive is the SkyServer (Szalay, Gray, et al. 2001, 2002; http://skyserver.org/), that provides access to data (photometry and spectra) for objects detected in the different SDSS data sets. This supports more than just positional searching - it offers the ability to pose arbitrary queries (expressed in SQL) against the data so that, for example, one can find all merging galaxy pairs or quasars with a broad absorption line and a nearby galaxy within 10 arcsec. Users can get their own work areas so that query results can be saved and files uploaded to use in queries (as user-supplied tables) against the SDSS data.

Currently, most significant surveys are stored in archives that are accessible through the Internet, using a variety of web service interfaces. Their interoperability is established through the Virtual Observatory framework. Enabling access to such survey archives via web services and not just web pages means that programs can be written to automatically analyze and explore vast amounts of data. Whole pipelines can be launched to coordinate and federate multiple queries against different archives, potentially taking hundreds of hours to automatically find the rarest species of objects. Of course, the utility of any such archive is only as good as the metadata provided and the hardest task is often figuring out exactly how the same concept is represented in different archives, for example, one archive might report flux in a particular passband and another magnitude, and manually reconciling these.

The Semantic Web is an emerging technology that can help solve these challenges (Antoniou & van Harmelen 2004). It is based on machine-processible descriptions of concepts, and it goes beyond simple term matching with expressions of concept hierarchies, properties and relationships allowing knowledge discovery. It is a way of encoding a domain expertise (e.g., in astronomy) in a way that may be used by a machine. Ultimately, it may lead to data inferencing by artificial intelligence (AI) engines. For example, discovering that a transient detection has no previous outburst history, is near a galaxy and has a spectrum with silicon absorption but no hydrogen, a system could reason that it is likely to be a Type Ia supernova and therefore its progenitor was a white dwarf and so perform an appropriate archival search to find it.

Cloud computing is an emerging paradigm that may well change the ways we approach data persistence, access, and exploration. Commodity computing brings economies of scale, and it effectively outsources a number of tedious tasks that characterize data-intensive science. It is possible that in the future most of our data, survey archives included, and data mining and exploration services for knowledge discovery, will reside in the Cloud.

Most of the modern survey data sets are so information-rich, that a wide variety of different scientific studies can be done with the same data. Therein lies their scientific potential (Djorgovski et al. 1997b, 2001a, b, c, 2002, Babu & Djorgovski 2004, and many others). However, this requires some powerful, general tools for the exploration, visualization, and analysis of large survey data sets. Reviewing them is beyond the scope of this Chapter, but one recent example is the Data Mining and Exploration system (DAME; Brescia et al. 2010, 2012; http://dame.dsf.unina.it); see also the review by Ball & Brunner (2010). The newly emerging discipline of Astroinformatics may provide a research framework and environment that would foster development of such tools.