Digital Sky Surveys: Software Tools and Technologies

1.2. Survey Archive Software

Once all of the data has been extracted from the image pixels by the survey pipeline software, it must be persisted in order to facilitate scientific exploration. Basically, this data archiving can take one of two forms: raw data access through the web or some suitable storage and distribution medium (e.g., CD-ROMs or data tapes), or utilization of a database system designed specifically to handle a given digital sky survey archive. The former storage technique, while considerably less expensive, is certainly not optimal for any real analysis, especially for very large data volumes. On the other hand, using a database system provides significant advantages (e.g., powerful query expressions) limited primarily by how much a survey can afford. Currently, most surveys are accessible from the Internet, whether it is a simple ftp location or a full featured, web-accessible, query engine.

While certainly flexible and generally easy to establish, web accessible data is subject to limitations on bandwidth. Despite technological advances which promise to ease such concerns, the rapid growth in available data will continue to swamp available resources. To illustrate this bandwidth problem, for a typical building network (e.g., shared Ethernet at 10 Mbit/s), it would take nearly a week and half to sift through a Terabyte of data (which is the current benchmark for large sky surveys), and this assumes a dedicated bandwidth. Even at fast SCSI speeds (e.g., 100 Mbit/s), it would take approximately one day to merely move the data of interest.

Each archival center is presented with the challenging problem of storing and serving vast amounts of complex data. Currently the majority of the software written for these applications is in either C++ or Java, while the actual data is transferred via ASCII, FITS, or XML. Recently, many of the major data centers have begun to work on sharing knowledge and expertise in order to simplify the development process, as well as improve the overall efficacy of astronomical archives. This work is leading to standards which dictate how archives can communicate with each other, how archives can describe themselves (i.e., their metadata, or data that describes the data), how archives can transfer large amounts of dynamic information, and how sources in different archives can be cross-identified.

In general, the data processing flow is from the pixel (image) domain to the catalog domain (detected sources with measured parameters). This usually results in a reduction of the data volume by about an order of magnitude (this factor varies considerably, depending on the survey or the data set), since most pixels do not contain statistically significant signal from resolved sources. However, the ability to store large amounts of digital image information on-line opens up interesting new possibilities, whereby one may want to go back to the pixels and remeasure fluxes or other parameters, on the basis of the catalog information. For example, if a source was detected (i.e., cataloged) in one bandpass, but not in another, it is worth checking if a marginal detection is present even if it did not make it past the statistical significance cut the first time; even the absence of flux is sometimes useful information!

The two most common types of database management systems used within astronomy are relational-based (where data are manipulated as tables), and object-based (where data is manipulated individually as objects). Each of the two methods have working sites which demonstrate the technology in action(e.g., the 2MASS project uses Informix, while the GSCII project uses Objectivity). In general, relational systems offer more features including third-party add-ons, and powerful query mechanisms due to their dominant position in the business world. On the other hand, object-based systems have shown higher performance and better potential for scaling to extremely large data sets (e.g., CERN is developing an object-based persistence solution to multi-Petabyte archives).

Regardless of how the data is actually persisted, with the advent of the Internet, essentially all archives are Web accessible. This trend away from tabular or media based archiving produces datasets which are distributed in nature, allowing users and tools equal access to vast amounts of data, which in the past were nearly impossible to efficiently query. As a result, astronomical data is now able to be utilized in a more democratic fashion resulting in uses which the survey institutions did not even imagine.