Published in the International Journal of Modern Physics D, Volume 19, Issue 07, pp. 1049-1106 (2010).
astro-ph/0906.2173

For a PDF version of the article, click here.

DATA MINING AND MACHINE LEARNING IN ASTRONOMY

Nicholas M. Ball


Herzberg Institute of Astrophysics, National Research Council, 5017 West Saanich Road, Victoria, BC V9E 2E7, Canada

Robert J. Brunner


Department of Astronomy, University of Illinois at Urbana-Champaign, 1002 West Green Street, Urbana, IL 61801, USA


Abstract: We review the current state of data mining and machine learning in astronomy. Data Mining can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.


Table of Contents

INTRODUCTION
Why Data Mining?

OVERVIEW OF DATA MINING AND MACHINE LEARNING METHODS
Data Collection
Preprocessing of Data
Attribute Selection
Selection and Use of Machine Learning Algorithms
Improving Results
Application of Algorithms and Some Limitations

USES IN ASTRONOMY
Object classification
Photometric redshifts
Other Astrophysical Applications

THE FUTURE
Probability Density Functions
Real-Time Processing and the Time Domain
Petascale Computing
Parallel and Distributed Data Mining
The Virtual Observatory
Visualization
Novel Supercomputing Hardware

CONCLUSIONS

REFERENCES

Next