Data Mining and Machine Learning in Astronomy - Nicholas M. Ball & Robert J. Brunner

1. INTRODUCTION

In its broadest sense, data mining is simply the act of turning raw data from an observation into useful information. This information can be interpreted by hypothesis or theory, and used to make further predictions. This scientific method, where useful statements are made about the world, has been widely employed to great effect in the West since the Renaissance, and even earlier in other parts of the world. What has changed in the past few decades is the exponential rise in available computing power, and, as a related consequence, the enormous quantities of observed data, primarily in digital form. The exponential rise in the amount of available data is now creating, in addition to the natural world, a digital world, in which extracting new and useful information from the data already taken and archived is becoming a major endeavor in itself. This action of knowledge discovery in databases (KDD), is what is most commonly inferred by the phrase data mining, and it forms the basis for our review.

Astronomy has been among the first scientific disciplines to experience this flood of data. The emergence of data mining within this and other subjects has been described [1, 2, 3] as the fourth paradigm. The first two paradigms are the well-known pair of theory and observation, while the third is another relatively recent addition, computer simulation. The sheer volume of data not only necessitates this new paradigmatic approach, but the approach must be, to a large extent, automated. In more formal terms, we wish to leverage a computational machine to find patterns in digital data, and translate these patterns into useful information, hence machine learning. This learning must be returned in a useful manner to a human investigator, which hopefully results in human learning.

It is perhaps not entirely unfair to say, however, that scientists in general do not yet appreciate the full potential of this fourth paradigm. There are good reasons for this of course: scientists are generally not experts in databases, or cutting-edge branches of statistics, or computer hardware, and so forth. What we hope to do in this review, primarily for the data mining skeptic, is to shed light on why this is a useful approach. To accomplish this goal, we emphasize either algorithms that have or could currently be usefully employed, and the actual scientific results they have enabled. We also hope to give an interesting and fairly comprehensive overview to those who do already appreciate this approach, and perhaps provide inspiration for exciting new ideas and applications. However, despite referring to data mining as a whole new paradigm, we try to emphasize that it is, like theory, observation, and simulation, only a part of the broader scientific process, and should be viewed and utilized as such. The algorithms described are tools that, when applied correctly, have vast potential for the creation of useful scientific results. But, given that it is only part of the process, it is, of course, not the answer to everything, and we therefore enumerate some of the limitations of this new paradigm.

We start in Section 1.1 with a summary of some of the advantages of this approach. In Section 2, we summarize the process from the input of raw data to the visualization of results. This is followed in Section 3 by the actual application of data mining tools in astronomy. Section 2 is arranged algorithmically, and Section 3 is arranged astrophysically. It is likely that the expert in astronomy or data mining, respectively, could infer much of Section 3 from Section 2, and vice-versa. But it is unlikely (we hope) that the combination of the two sections does not have new ideas or insights to offer to either audience. Following these two sections, in Section 4, we combine the lessons learned to discuss the future of data mining in astronomy, pointing out likely near-term future directions in both the data mining process and its physical application. We conclude with a summary of the main points in Section 5.

1.1. Why Data Mining?

Of course, what astronomers care about is not a fashionable new computational method for ever more complex data analysis, but the science. A fancy new data mining system is not worth much if all it tells you is what you could have gained by the judicious application of existing tools and a little physical insight [4]. We therefore summarize some of the advantages of this approach:

Getting anything at all: upcoming datasets will be almost overwhelmingly large. When one is faced with Petabytes of data, a rigorous, automated approach that intelligently extracts pertinent scientific information will be the only one that is tractable.

Simplicity: despite the apparent plethora of methods, straightforward applications of very well-known and well-tested data mining algorithms can quickly produce a useful result. These methods can generate a model appropriate to the complexity of an input dataset, including nonlinearities, implicit prior information, systematic biases, or unexpected patterns. With this approach, a priori data sampling of the type exemplified by elaborate color cuts, is not necessary. For many algorithms, new data can be trivially incorporated as they become available.

Prior information: this can be either fully incorporated, or the data can be allowed to completely `speak for themselves'. For example, an unsupervised clustering algorithm can highlight new classes of objects within a dataset that might be missed if a prior set of classifications were imposed.

Pattern recognition: an appropriate algorithm can highlight patterns in a dataset that might not otherwise be noticed by a human investigator, perhaps due to the high dimensionality. Similarly, rare or unusual objects can be highlighted.

Complimentary approach: although there are numerous examples where the data mining approach demonstrably exceeds more traditional methods in terms of scientific return. Even when the approach does not produce a substantial improvement, it still acts as an important complementary method of analyzing data, because different approaches to an overall problem help to mitigate systematic errors in any one approach.