Machine Learning in Astronomy: a practical overview

1. CONTEXT

Astronomical datasets are undergoing a rapid growth in size and complexity, thus introducing Astronomy to the era of big data science (e.g., Ball & Brunner 2010, Pesenson et al. 2010). This growth is a result of past, ongoing, and future surveys, that produce massive multi-temporal and multi-wavelength datasets, with a wealth of information to be extracted and analyzed. Such surveys include the Sloan Digital Sky Survey (SDSS; York et al. 2000), which provided the community with multi-color images of ∼ 1/3 of sky, and high-resolution spectra of millions of Galactic and extra-galactic objects. Pan-STARRS (Kaiser et al., 2010) and the Zwicky Transient Facility (Bellm 2014) perform a systematic exploration of the variable sky, delivering time-series of numerous asteroids, variable stars, supernovae, active galactic nuclei, and more. Gaia (Gaia Collaboration et al. 2016) is charting the three-dimensional map of the Milky Way, and will provide accurate positional and radial velocity measurements for over a billion stars in our Galaxy and throughout the Local Group. Future surveys, e.g., DESI (Levi et al. 2013), SKA (Dewdney et al. 2009), and LSST (Ivezic et al. 2008), will increase the number of available objects and their measured properties by more than an order of magnitude.

In light of this accelerated growth, astronomers are developing automated tools to detect, characterize, and classify objects using the rich and complex datasets gathered with the different facilities. Machine learning algorithms have gained increasing popularity among astronomers, and are widely used for a variety of tasks.

Machine learning algorithms are generally divided into two groups. Supervised machine learning algorithms are used to learn a mapping from a set of features to a target variable, based on example input-output pairs provided by a human expert (see e.g., Connolly et al. 1995; Collister & Lahav 2004; Re Fiorentin et al. 2007; , Mahabal et al. 2008; Daniel et al. 2011; Laurino et al. 2011; Morales-Luis et al. 2011; Bloom et al. 2012; Brescia et al. 2012; Richards et al. 2012; Krone-Martins et al. 2014; Masci et al. 2014; Miller 2015; Wright et al. 2015; Djorgovski et al. 2016; D'Isanto et al. 2016; Lochner et al. 2016; Castro et al. 2018; Naul et al. 2018; D'Isanto & Polsterer 2018; D'Isanto et al. 2018; Krone-Martins et al. 2018; Zucker & Giryes 2018; Delli Veneri et al. 2019; Ishida et al. 2019; Mahabal et al. 2019; Norris et al. 2019; Reis et al. 2019). Unsupervised learning algorithms are used to learn complex relationships that exist in the dataset, without labels provided by an expert. These can roughly be divided into clustering, dimensionality reduction, and anomaly detection (e.g., Boroson & Green 1992; Protopapas et al. 2006; D'Abrusco et al. 2009; Vanderplas & Connolly 2009; Sánchez Almeida et al. 2010; Ascasibar & Sánchez Almeida 2011; D'Abrusco et al. 2012; Meusinger et al. 2012; Fustes et al. 2013; Krone-Martins & Moitinho 2014; Baron et al. 2015; Hocking et al. 2015; Gianniotis et al. 2016; Nun et al. 2016; Polsterer et al. 2016; Baron & Poznanski 2017; Reis et al. 2018a, b). The latter algorithms are arguably more important for scientific research, since they can be used to extract new knowledge from existing datasets, and can potentially facilitate new discoveries.

In view of the shift in data analysis paradigms and associated challenges, the IAC Winter School 2018 focused on big data in Astronomy. It included both lectures and hands-on tutorials, which are publicly available through their website ¹ The school covered the following topics: (1) general overview on the use of machine learning techniques in Astronomy: past, present and perspectives, (2) data challenges and solutions in forthcoming surveys, (3) supervised learning: classification and regression, (4) unsupervised learning and dimensionality reduction techniques, and (5) shallow and deep neural networks. In this document I summarize the topics of supervised and unsupervised learning algorithms, with special emphasis on unsupervised techniques. This document is not intended to provide a rigorous statistical background, but rather to present practical information on popular machine learning algorithms and their application to astronomical datasets. Supervised learning algorithms are discussed in section 2, with an emphasis on optimization (section 2.1), input datasets (section 2.2), and three popular algorithms: Support Vector Machine (section 2.3), Decision Trees and Random Forest (section 2.4), and shallow Artificial Neural Networks (section 2.5). Unsupervised learning algorithms are discussed in section 3, in particular distance assignment (section 3.1), clustering algorithms (section 3.2), dimensionality reduction algorithms (section 3.3), and anomaly detection algorithms (section 3.4).

¹ http://www.iac.es/winterschool/2018/ Back.