Dalya Baron, Stanford University

When
Register to Attend
TAP Computation and Data Initiative Computational Workshops
Dalya Baron, Stanford University
Visit Dates: Dec 8 - 10
Steward Observatory, N305
Applications of Unsupervised Machine Learning Techniques for Data Exploration and Discovery in Astronomy
In this workshop, we will explore several classes of algorithms that belong to the family of unsupervised machine learning. These algorithms are called unsupervised because they do not require “ground truth” labels or target variables for training. Instead, they operate directly on the data and are used for clustering, component separation, dimensionality reduction, data visualization, and outlier detection. These methods are particularly useful for exploring the complex and heterogeneous datasets common in astronomy and can facilitate new discoveries.
We will begin with a broad discussion of the motivation for using data science and machine learning techniques in the context of data exploration and discovery. From there, we will work together to understand how to apply these tools effectively to astronomical datasets. Since all of these methods rely on some notion of distance or similarity to relate astronomical objects, we will first look at different ways to represent data and consider the tradeoffs of each approach. We will then survey dimensionality reduction, clustering, and outlier detection techniques, and discuss how to interpret their outputs meaningfully. Finally, we will go through a set of guidelines for incorporating unsupervised machine learning into our own research in a safe and constructive
way.
Learning Objectives:
- Gain understanding of unsupervised learning methods
- Use methods to develop data representations
- Apply dimensionality reduction techniques
Links to material/software:
Reading: Short review can be found here: https://arxiv.org/abs/1904.07248 - For more in-depth reading, the book Statistics, Data Mining, and Machine Learning in Astronomy (link) is a great resource.
Code: It is highly recommended to work in Python, as all of the techniques have been implemented in it, and most use a fairly standard input-output structure. The easiest way to ensure that all the relevant packages are available is to install Anaconda. We recommend working in a new environment specifically defined for the school.
Below are some online resources that explain the importance of using separate Anaconda environments, along with instructions for installing Anaconda and setting up your first environment (the first link includes all the steps you need):
• https://www.geeksforgeeks.org/machine-learning/set-up-virtual-environme…;
• https://www.anaconda.com/docs/getting-started/anaconda/install
• https://www.anaconda.com/docs/tools/working-with-conda/environments