Knowledge discovery in databases (KDD) refers to the complex process of applying data mining (e.g., pattern finding) and modern statistical analysis techniques to extract knowledge from large databases and image archives. The first wave of results in a number of areas of scientific research and business has made it increasingly clear that in order to discover something fundamentally new, and to adequately handle encounters with various "mine fields," it is essential to incorporate prior domain knowledge into the KDD process [8]. As outlined above, new and emerging capabilities of NED provide a valuable resource for incorporating prior knowledge into future KDD applications in astronomy. For example, using the present HTML output or the future XML server, one could write a client program that extracts information from NED's panchromatic SEDs to fold into automated algorithms that search for different known types of extragalactic objects in new survey data. Likewise, a thorough comparison of new observations with images, SEDs, high resolution spectra, or literature resources available in NED (or distributed archive entries linked with NED by object names and coordinates) can prevent pitfalls such as a false claim of a new class of extragalactic object. As a third example, the NED image archive presents an unprecedented resource for developing and testing advanced algorithms to tackle the differing resolutions, pixel scales, and calibration uncertainties in multi-wavelength images that must be confronted to make novel discoveries. NED's image archive already provides hundreds of objects that have images spanning ultraviolet, visual, near-infrared, and radio wavelengths. In addition, construction of intelligent Web `agent' programs that can transverse the External Links from NED object queries and extract relevant information in an automated fashion could expand the possibilities for discovery in innumerable ways.
5.1. Fusion and Classification Using Large Astronomical Databases
Since NED currently serves a large fraction of the world-wide extragalactic research community with fused, multi-wavelength data for millions of objects, it provides a unique opportunity to introduce many astronomers to new analysis tools and protocols developed by the VO community. The planned upgrades discussed above will enable the use of NED data streams containing multi-wavelength, multi-dimensional data such as SEDs and object classifications (with pointers to additional, distributed data) in extragalactic data mining applications. NED can effectively serve this role because the database contains 10-50 attributes (positions, redshifts, multi-wavelength photometric measurements, and object classifications) for millions of extragalactic objects. Using NED as a test-bed for data mining algorithms is a logical initial step for more ambitious VO efforts planning to eventually handle hundreds of attributes in catalogs containing 108 - 109 objects or more, which is common when all types of astrophysical sources are blended together. (Most survey catalogs initially contain stars, Galactic nebulae, galaxies, QSOs, asteroids, etc., until they are classified and extracted into specialized lists.)
High dimensionality presents great challenges to effectively visualize, summarize, and extract new information from large databases. Data fusion across large databases is fraught with practical problems: compiling a complete sample across many wavelengths is difficult; source cross-identifications are non-trivial to get right; there are problems of duplicate observations, contamination, confusion, etc.; different coding schemes for missing data must be made uniform; effects of sampling biases must be considered; mixed data types include quantitative (continuous), categorical (nominal), and binary (e.g., quality flags, codes); ignoring or treating non-detections, upper-limits and different flux limits (censored data) in combined surveys can lead to biases; observed scatter can be intrinsic to astrophysical sources, measurement uncertainties, or both; most classical multivariate statistical methods do not handle explicit data uncertainties. Nevertheless, if we are to live up to the challenges of making discoveries from fused VO archives, these and the problems of scale will have to be confronted and solved.
Here we briefly summarize an ongoing project to exploit NED and its interconnected archive resources to aid the process of automated, large-scale galaxy classification. A representative problem is that among approximately a half-million objects in the Second Incremental Release of the 2MASS Extended Source Catalog, only about 10 % were previously known objects (established using NED, prior to loading the 2MASS sources into the database), and only a small number of known NED objects with 2MASS cross-identifications have available morphological or spectral classifications from the historical catalogs and literature. A common goal of many `` data miners'' in astronomy is to discover large numbers of new cases of previously known types of objects, to perform statistical analyses which typically suffer from small number statistics and severe selection biases in previous investigations. Another `` Holy Grail'' is the potential to discover a new class of objects, those rare nuggets that may teach us something fundamentally new about the contents, structure or evolution of the Universe. Clearly using classical approaches, even one involving manual, interactive queries of NED and connected online archives to classify the all the previously known objects in 2MASS (or the SDSS) is impractical. A more automated approach is needed. An outline of the steps required for this pilot project are as follows: (1) construct cross-identifications between near-infrared sources in 2MASS, visual sources in SDSS, and radio sources in the NVSS and FIRST surveys; (2) fuse the cross-correlated survey data with source classifications and other available data in NED - morphological types, nuclear activity types (starburst/HII, LINER, Sy2, Sy 1, QSO, etc.), magnitudes, redshifts; (3) comprehensively summarize correlations, dominant variables, and clusters in the resulting high dimensionality (N > 20), multi-wavelength data matrix; (4) produce a training set for machine learning classifiers (decision trees and neural nets); (5) derive provisional classifications (predictors) for the sources that lack any historical data. These results will guide follow-up observations by providing candidate lists of known classes of extragalactic objects, candidates for possible previously unknown classes of objects, and rare objects (outliers) revealed in the multivariate analysis. The results will be published in summary form and be made available in bulk on the Internet as a resource for other investigators.