This course is part of the doctoral program in astronomy at the Department of Physics and Astronomy at the University of Porto. The course title is "Topics on Methods and Modelling in Astrophysics". The course is divided in two parts. March 6-10 we have lectures from 10:00 to 13:00 each day, while March 13-17 is dedicated to a practical problemset provided in the lectures.
The aim of this course is to get a good practical grasp of machine learning. I will not spend a lot of time on algorithm details but more on how to use these in python and try to discuss what methods are useful for what type of scientific question/research goal.
- March 6 - Managing data and simple regression
-
- Covering git and SQL
- Introducing machine learning through regression techniques.
- March 7 - Visualisation and inference methods
-
- Visualisation of data, do's and don't's
- Classical inference
- Bayesian inference
- MCMC
- March 8 - Density estimation and model choice
-
- Estimating densities, parametric & non-parametric
- Bias-variance trade-off
- Cross-validation
- Classification
- March 9 - Dimensional reduction
-
- Standardising data.
- Principal Component Analysis
- Manifold learning
- April 10 - Ensemble methods, neural networks, deep learning
-
- Local regression methods
- Random forests and other boosting methods
- Neural networks & deep learning
I expect that you have read through these two documents:
-
A couple of Python & Topcat pointers. This is a very basic document and might not contain a lot of new stuff. It does have a couple of tasks to try out - the solution for these you can find in the [ProblemSets/0 - Pyton and Topcat](ProblemSets/0 - Pyton and Topcat) directory.
-
A reminder/intro to relevant math contains a summary of some basic facts from linear algebra and probability theory that are useful for this course.
Below you can find some books of use. The links from the titles get you to the Amazon page. If there are free versions of the books legally available online, I include a link as well.
-
I base myself partially on "Statistics, Data Mining, and Machine Learning in Astronomy" - Ivezic, Connolly, VanderPlas & Gray
-
I have also consulted "Deep Learning" - Goodfellow, Bengio & Courville
-
"Pattern Classification" - Duda, Hart & Stork, is a classic in the field
-
"Pattern Recognition and Machine Learning" - Bishop, is a very good and comprehensive book. Personally I really like this one.
-
"Bayesian Data Analysis" - Gelman, is often the first book you are pointed to if you ask questions about Bayesian analysis.
-
"Information Theory, Inference and Learning Algorithms" - MacKay, is a very readable book on a lot of related topics. The book is also freely available on the web.
-
"Introduction to Statistical Learning - James et al" is a readable introduction (fairly basic) to statistical technique of relevance. It is also freely available on the web.
-"Elements of Statistical Learning - Hastie et al, is a more advanced version of the Introduction to Statistical Learning with much the same authors. This is also freely available on the web.
- "Bayesian Models for Astrophysical Data", Hilbe, Souza & Ishida is a good reference book for a range of Bayesian techniques and is a good way to learn about different modelling frameworks for Bayesian inference.
The course will make use of python throughout, and for this you need a recent version of python installed. I use python 3 by default and while some scripts will work for python 2, there is really no good reason for continuing to use python 2 (with some exception for important legacy code). For python you will need (well, I recommend it at least) at least these libraries installed:
- numpy - for numerical calculations
- astropy - because we are astronomers
- scipy - because we are scientists
- sklearn - Machine learning libraries with full name scikit-learn.
- matplotlib - plotting (you can use alternatives of course)
- pandas - nice handling of data
- seaborn - nice plots
(the last two are really "nice to have" but if you can install the others then these are easy).
Personally I use the
Anaconda Python
distribution to manage my python installation and to create
environments. I strongly recommend using environments (often called
virtual environments) for this course. These come in two main
flavours, the built-in venv virtual environments, or the ones
provided by conda. See for instance this
overview
for instance (which is focused on conda) or this
one for
a more venv focused intro. Since I use conda my examples will use
that but it is pretty easy to translate to venv instead.
To set things up for this course, what I did (after installing anaconda) was
# Create an environment
> conda create -n mld2023 numpy scipy scikit-learn pandas seaborn matplotlib jupyter astropy pip
...
> conda activate mld2023
The first command is only done once, the second is done every time you start a new shell.
You should also get astroML which has a nice web page at
http://www.astroml.org/ and a git
repository at
https://github.com/astroML/astroML. This
is the website associated to the "Statistics, Data Mining, and Machine
Learning in Astronomy" book mentioned above. They also provide clear
installation
instructions. Personally
I used their "From Source" instructions but it is probably in general
easier to use the "Conda" instructions if you use Anaconda and the
"Python Package Index" instructions otherwise.
In this case you will want to fork the repository rather than just clone this. You can follow the instructions below (credit to Alexander Mechev for a first version of this) to create a fork of the repository:
- Make a github account and log in.
- Go to the MLD2023 repo.
- Click on the 'Fork' at the top right. This will create a 'fork' on your own account. That means that you now have the latest commit of the repo and its history in your control. If you've tried to 'git push' to the MLD2023 repo you'd have noticed that you don't have access to it.
- Once it's forked, you can go to your github profile and you'll see a MLD2023 repo. Go to it and get the .git link (green button)
- Somewhere on your machine, do
https://github.com/[YOUR_GIT_UNAME]/MLD2023.git
- Move into the directory by doing
> cd MLD2023. - Add my repo as an upstream. That way you can get (pull) new
updates:
https://github.com/jbrinchmann/MLD2023.git - git remote -v should give: origin https://github.com/[YOUR_GIT_UNAME]/MLD2023.git (fetch) origin https://github.com/[YOUR_GIT_UNAME]/MLD2023.git (push) upstream https://github.com/jbrinchmann/MLD2023.git (fetch) upstream https://github.com/jbrinchmann/MLD2023.git (push)
- Now you're ready to add files and folders to your local fork. Use
git add,git commitandgit push. To add store this work online.
The slides are available in the Lectures directory. You can find some files for creating tables in the ProblemSets/MakeTables directory.
In the final problem class we will look at using deep learning in python. In order to follow the examples, you will need to have some software installed. This is more involved than what we had above so might take some time to get working.
There are quite a few libraries for this around but we will use the most commonly used one, TensorFlow and we will use the keras python package for interacting with TensorFlow. Keras is a high-level interface (and can also use other libraries, Theano and CNTK, in addition to TensorFlow).
There are many pages that detail the installation of these packages and what you need for them. A good one with a bias towards Windows is this one. I will give a very brief summary here of how I set things up. This is not optimised for Graphical Processing Unit (GPU) work so for serious future work you will need to adjust this.
I am going to assume you use anaconda for your python environment. If
not, you need to change this section a bit - use virtualenv instead of
setting up a conda environment. It is definitely better to keep your
TensorFlow/keras etc setup out of your default Python work
environment. Most of the packages are also installed with pip rather
than conda, so what I use in this case is:
> conda create -n tensorflow pip
<...>
> activate tensorflow
It is important to activate the environment, otherwise you'll mess up
your default conda environment! Check that your prompt says
[tensorflow] before continuing (see below):
[tensorflow] > pip install matplotlib astropy pandas scikit-learn seaborn jupyter astroML
<...>
[tensorflow] >
Tensorflow is a large package - 244 Mb in my installation and it requires a fair number of additional packages so this can take a bit of time.
[tensorflow] > pip install --upgrade tensorflow
<...>
[tensorflow] > pip install --upgrade keras
That should set up you fairly well for a first dip into deep learning.
Unfortunately this might not work for you out of the box - in
particular the dependency of tensorflow on grpcio can lead to a
lot of problems on Mac OS. See this
discussion for some
discussion of this. What I ended up doing was installing grpcio via
conda.