In this review, we have introduced data mining in astronomy, given an overview of its implementation in the form of knowledge discovery in databases, reviewed its application to various science problems, and discussed its future. Throughout, we have tried to emphasize data mining as a tool to enable improved science, not as an end in itself, and to highlight areas where improvements have been made over previous analyses, where they might yet be made, and limitations of this approach.
An astronomer is not a cutting-edge expert in data mining algorithms any more than they are in statistics, databases, hardware, software, etc., but they will need to know enough to usefully apply such approaches to the science problem they wish to address. It is likely that such progress will be made via collaboration with people who are experts in these areas, particularly within large projects, that will employ specialists and have working groups dedicated to data mining. Fully implemented, commercial-level databases will be required since the data will be too big to organize, download, or analyze in any other way.
The available infrastructure should, therefore, be designed so that this data mining approach to research is maximally enabled. The raw or minimally-processed data should be made available in a manner so one can apply user-specific codes either locally or using computational resources local to the data if data size necessitates it. It is unlikely that most researchers will either require or trust the exact resources made available by higher level tools. Instead, they will be useful for exploratory work, but ultimately one must be able to run personal or trusted code on the data, from the level of re-reduction upwards.
A problem arises when one wishes to utilize multiple or distributed datasets, for example in cross-matching data for multi-wavelength studies. Therefore, datasets that can be easily made interoperable via a standard storage schema should be made available. In this manner, a user can bring computing power and algorithms to tackle their particular science question. This problem is particularly acute when large datasets are held at widely separated sites, because transfer of such data across the network is currently impractical. A great deal of science is done on small subsets of the full data, so data will still be frequently downloaded and analyzed locally, but the paradigm of downloading entire datasets is not sustainable.
We thank the referee for a useful and comprehensive report.
The authors acknowledge support from NASA through grants NN6066H156 and NNG06GF89G, from Microsoft Research, and from the University of Illinois.
The authors made extensive use of the storage and computing facilities at the National Center for Supercomputing Applications and thank the technical staff for their assistance in enabling this work.
This research has made use of the SAO/NASA Astrophysics Data System.