Next Contents Previous


A major new trend is emerging in observational astronomy with the production of huge, uniform, multivariate databases from specialized survey projects and telescopes (4). But they are heterogeneous in character, reside at widely dispersed locations, and accessed through different database systems. Examples include:

  1. 108 - 109-object catalogs of stars and stellar extragalactic objects (i.e., quasars). These include the all-sky photographic optical USNO-B1 catalog, the all-sky near-infrared 2MASS catalog, and the wide-field Sloan Digital Sky Survey (SDSS). Five to ten photometric values, each with measured heteroscedastic measurement errors (i.e., different for each data point), are available for each object.

  2. 105 - 106-galaxy redshift catalogs from the 2-degree Field (2dF) and SDSS spectroscopic surveys. The main goal is characterization of the hierarchical, nonlinear and anisotropic clustering of galaxies in a 3-dimensional space. But the datasets also include spectra for each galaxy each with 103 independent measurements.

  3. 105 - 106-source catalogs from various multiwavelength wide-field surveys such as the NRAO Very Large Array Sky Survey in one radio band, the InfraRed Astronomical Satellite Faint Source catalog in four infrared bands, the Hipparcos and Tycho catalogs of star distances and motions, and the X-ray Multimirror Mission Serendipitous Source Catalogue in several X-ray bands now in progress. These catalogs are typically accompanied by large image libraries.

  4. 102 - 104-object samples of well-characterized pre-main sequence stars, binary stars, variable stars, pulsars, interstellar clouds and nebulae, nearby galaxies, active galactic nuclei, gamma-ray bursts and so forth. There are dozens of such samples with typically 10 - 20 catalogued properties and often with accompanying 1-, 2- or 3-dimensional images or spectra.

  5. Perhaps the most ambitious of such surveys is the planned Large-aperture Synoptic Survey Telescope (LSST) which will survey much of the entire optical sky every few nights. It is expected to generate raw databases in excess of 10 PBy (petabyte) and catalogs with 1010 entries.

An international effort known as the Virtual Observatory (VO) is now underway to coordinate and federate these diverse databases, making them readily accessible to the scientific user [6, 29]. Considerable progress is being made in the establishment of the necessary data and metadata infrastructure and standards, interoperability issues, data mining, and technology demonstration prototype services (5). But scientific discovery requires more than effective recovery and distribution of information. After the astronomer obtains the data of interest, tools are needed to explore the datasets. How do we identify correlations and anomalies within the datasets? How do we classify the sources to isolate subpopulations of astrophysical interest? How do we use the data to constrain astrophysical interpretation, which often involve highly non-linear parametric functions derived from fields such as physical cosmology, stellar structure or atomic physics? These questions lie under the aegis of statistics.

A particular problem relevant to statistical computing is that, while the speed of CPUs and the capacity of inexpensive hard disks rise rapidly, computer memory capacities grow at a slower pace. Combining the largest optical/near-infrared object catalogs today produces a table with > 1 billion objects and up to a dozen columns of photometric data. Such large datasets effectively preclude use of all standard multivariate statistical packages and visualization tools (e.g., R and GGobi) which are generally designed to place the entire database into computer memory. Even sorting the data to produce quantiles may be computational infeasible.

The Virtual Observatory of the 21st century thus presents new challenges to statistical capability in two ways. First, some new methodological developments are needed (Section 5). Second, efficient access to both new and well-established statistical methods are needed. No single existing software package can provide the vast range of needed methods. We are now involved in developing a prototype system called VOStat to provide statistical capabilities to the VO astronomer. It is based on concepts of Web services and distributed Grid computing. Here, the statistical software and computational resources, as well as the underlying empirical databases, may have heterogeneous structures and can reside at distant locations.

4 An enormous collection of catalogs, and some of the underlying imaging and spectral databases, are already available on-line. Access to many catalogs is provided by Vizier ( The NASA Extragalactic Database (NED,, SIMBAD stellar database (, and ADS (footnote 2) give integrated access to many catalogs and bibliographic information. Raw data are available from all U.S. space-based observatories; see, for example, the Multi-mission Archive at Space Telescope (MAST, and High Energy Astrophysics Science Archive Research Center (HEASARC, Back.

5 See and / for entry into Virtual Observatory projects. Back.

Next Contents Previous