Machine Learning in Astronomy: a practical overview

4. SUMMARY

In recent years, machine learning algorithms have gained increasing popularity in Astronomy, and have been used for a wide variety of tasks. In this document I summarized some of the popular machine learning algorithms and their application to astronomical datasets. I reviewed basic topics in supervised learning, in particular selection and preprocessing of the input dataset, evaluation metrics of supervised algorithms, and a brief description of three popular algorithms: SVM, Decision Trees and Random Forest, and shallow Artificial Neural Networks. I mainly focused on unsupervised learning techniques, which can be roughly divided into clustering analysis, dimensionality reduction, and outlier detection. The most popular application of machine learning in Astronomy is its supervised setting, where a machine is trained to perform classification or regression according to previously-acquired scientific knowledge. While less popular in Astronomy, unsupervised learning algorithms can be used to mine our datasets for novel information, and potentially enable new discoveries. In section 4.1 I list a number of open questions and issues related to the application of machine learning algorithms in Astronomy. Then, in section 4.2, I refer the reader to textbooks and online courses that give a more extensive overview of the subject.

4.1. Open Questions

The main issues of applying supervised learning algorithms to astronomical datasets include uncertainty treatment, knowledge transfer, and interpretability of the resulting models. As noted in section 3.1.1, most supervised learning algorithms are not constructed for astronomical datasets, and they implicitly assume that all measured features are of the same quality, and that the provided labels can be considered as ground truth. However, astronomical datasets are noisy and have gaps, and in many cases, the labels provided by human experts suffer from some level of ambiguity. As a result, supervised learning algorithms perform well when applied to high signal-to-noise ratio datasets, or to datasets with uniform noise properties. The performance of supervised learning algorithms strongly depends on the noise characteristics of the objects in the sample, and as such, an algorithm that was trained on a dataset with particular noise characteristics will fail to generalize to a similar dataset with different noise characteristics. It is therefore necessary to change existing tools and to develop new algorithms, which take into account uncertainties in the dataset during the model construction. Furthermore, such algorithms should provide prediction uncertainties, which are based on the intrinsic properties of the objects in the sample and on their measurement uncertainties.

The second challenge in applying supervised learning algorithms to astronomical datasets is related to knowledge transfer. That is, an algorithm that is trained on a particular survey, with a particular instrument, cadence, and object targeting selection, will usually fail to generalize to a different survey with different characteristics, even if the intrinsic properties of the objects observed by the two surveys are similar. As a result, machine learning algorithms are typically applied to concluded surveys, and rarely applied to ongoing surveys that have not yet collected enough labeled data. The topic of knowledge transfer is of particular importance when searching for rare phenomena, such as gravitational lenses in galaxy images, where supervised learning algorithms that are trained on simulated data cannot generalize well to real datasets. This challenge can be addressed with transfer learning techniques. While such techniques are discussed in the computer science literature, they are seldom applied in Astronomy.

The third challenge in applying supervised learning algorithms to astronomical datasets is related to the interpretation of the resulting models. While supervised learning algorithms offer an extremely flexible and general framework to construct complex decision functions, and can thus outperform traditional algorithms in classification and regression tasks, the resulting models are often difficult to interpret. That is, we do not always understand what the model learned, and why it makes the decisions that it makes. As scientists, we usually wish to understand the constructed model and the decision process, since this information can teach us something new about the underlying physics. This challenge is of particular importance in state-of-the-art deep learning techniques, which were shown to perform exceptionally-well in a variety of tasks. As we continue to develop new complex tools to perform classification and regression, it is important to devise methods to interpret their results as well.

When applying unsupervised learning algorithms to astronomical datasets, the main challenges include the interpretation of the results and comparison of different unsupervised learning algorithms. Unsupervised learning algorithms often optimize some internal cost function, which does not necessarily coincide with our scientific motivation, and since these algorithms are not trained according to some definition of "ground truth", their results might lead to erroneous interpretations of trends and patterns in our datasets. Many of the state-of-the-art algorithms are modular, thus allowing us to define a cost function that is more appropriate for the task at hand. It is therefore necessary to formulate cost functions that match our scientific goals better. To interpret the results of an unsupervised learning algorithm and to compare between different algorithms, we still use domain knowledge, and the process cannot be completely automatized. To improve the process of interpreting the results, we must improve the machine-human interface through which discoveries are made, e.g., by constructing visualization tools that incorporate post-processing routines which are typically carried out after applying unsupervised learning algorithms. Finally, as we continue to apply unsupervised learning algorithms to astronomical datasets, it is necessary to construct evaluation metrics that can be used to compare the outputs of different algorithms.

4.2. Further Reading

To learn more about the basics of machine learning algorithms, I recommend the publicly-available machine learning course in coursera ¹⁹. For an in-depth reading on statistics, data mining, and machine learning in Astronomy, I recommend the book by Ivezic et al. (2014), which covers in greater depth many of the topics presented in this document, and many other related topics. For additional examples on machine learning in Astronomy, implemented in python, I recommend astroML ²⁰ (Vanderplas et al. 2012).

I am grateful to I. Arcavi, N. Lubelchick, D. Poznanski, I. Reis, S. Shahaf, and A. Sternberg for valuable discussions regarding the topics presented in this document and for helpful comments on the text.

¹⁹ https://www.coursera.org/learn/machine-learning Back.

²⁰ http://www.astroml.org/ Back.