Skip to content

jbrinchmann/MLD2023

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine learning and Databases at CAUP/IA in 2023

Course overview

This course is part of the doctoral program in astronomy at the Department of Physics and Astronomy at the University of Porto. The course title is "Topics on Methods and Modelling in Astrophysics". The course is divided in two parts. March 6-10 we have lectures from 10:00 to 13:00 each day, while March 13-17 is dedicated to a practical problemset provided in the lectures.

The aim of this course is to get a good practical grasp of machine learning. I will not spend a lot of time on algorithm details but more on how to use these in python and try to discuss what methods are useful for what type of scientific question/research goal.

March 6 - Managing data and simple regression
  • Covering git and SQL
  • Introducing machine learning through regression techniques.
March 7 - Visualisation and inference methods
  • Visualisation of data, do's and don't's
  • Classical inference
  • Bayesian inference
  • MCMC
March 8 - Density estimation and model choice
  • Estimating densities, parametric & non-parametric
  • Bias-variance trade-off
  • Cross-validation
  • Classification
March 9 - Dimensional reduction
  • Standardising data.
  • Principal Component Analysis
  • Manifold learning
April 10 - Ensemble methods, neural networks, deep learning
  • Local regression methods
  • Random forests and other boosting methods
  • Neural networks & deep learning

Literature for the course

I expect that you have read through these two documents:

  • A couple of Python & Topcat pointers. This is a very basic document and might not contain a lot of new stuff. It does have a couple of tasks to try out - the solution for these you can find in the [ProblemSets/0 - Pyton and Topcat](ProblemSets/0 - Pyton and Topcat) directory.

  • A reminder/intro to relevant math contains a summary of some basic facts from linear algebra and probability theory that are useful for this course.

Below you can find some books of use. The links from the titles get you to the Amazon page. If there are free versions of the books legally available online, I include a link as well.

-"Elements of Statistical Learning - Hastie et al, is a more advanced version of the Introduction to Statistical Learning with much the same authors. This is also freely available on the web.

Software you need for the course

The course will make use of python throughout, and for this you need a recent version of python installed. I use python 3 by default and while some scripts will work for python 2, there is really no good reason for continuing to use python 2 (with some exception for important legacy code). For python you will need (well, I recommend it at least) at least these libraries installed:

  • numpy - for numerical calculations
  • astropy - because we are astronomers
  • scipy - because we are scientists
  • sklearn - Machine learning libraries with full name scikit-learn.
  • matplotlib - plotting (you can use alternatives of course)
  • pandas - nice handling of data
  • seaborn - nice plots

(the last two are really "nice to have" but if you can install the others then these are easy).

Personally I use the Anaconda Python distribution to manage my python installation and to create environments. I strongly recommend using environments (often called virtual environments) for this course. These come in two main flavours, the built-in venv virtual environments, or the ones provided by conda. See for instance this overview for instance (which is focused on conda) or this one for a more venv focused intro. Since I use conda my examples will use that but it is pretty easy to translate to venv instead.

To set things up for this course, what I did (after installing anaconda) was

# Create an environment
> conda create -n mld2023 numpy scipy scikit-learn pandas seaborn matplotlib jupyter astropy pip
...
> conda activate mld2023

The first command is only done once, the second is done every time you start a new shell.

You should also get astroML which has a nice web page at http://www.astroml.org/ and a git repository at https://github.com/astroML/astroML. This is the website associated to the "Statistics, Data Mining, and Machine Learning in Astronomy" book mentioned above. They also provide clear installation instructions. Personally I used their "From Source" instructions but it is probably in general easier to use the "Conda" instructions if you use Anaconda and the "Python Package Index" instructions otherwise.

Making a copy of the repository that you can edit

In this case you will want to fork the repository rather than just clone this. You can follow the instructions below (credit to Alexander Mechev for a first version of this) to create a fork of the repository:

  • Make a github account and log in.
  • Go to the MLD2023 repo.
  • Click on the 'Fork' at the top right. This will create a 'fork' on your own account. That means that you now have the latest commit of the repo and its history in your control. If you've tried to 'git push' to the MLD2023 repo you'd have noticed that you don't have access to it.
  • Once it's forked, you can go to your github profile and you'll see a MLD2023 repo. Go to it and get the .git link (green button)
  • Somewhere on your machine, do
https://github.com/[YOUR_GIT_UNAME]/MLD2023.git

Lectures

The slides are available in the Lectures directory. You can find some files for creating tables in the ProblemSets/MakeTables directory.

Getting ready for deep learning in python

In the final problem class we will look at using deep learning in python. In order to follow the examples, you will need to have some software installed. This is more involved than what we had above so might take some time to get working.

There are quite a few libraries for this around but we will use the most commonly used one, TensorFlow and we will use the keras python package for interacting with TensorFlow. Keras is a high-level interface (and can also use other libraries, Theano and CNTK, in addition to TensorFlow).

There are many pages that detail the installation of these packages and what you need for them. A good one with a bias towards Windows is this one. I will give a very brief summary here of how I set things up. This is not optimised for Graphical Processing Unit (GPU) work so for serious future work you will need to adjust this.

Create an environment in anaconda

I am going to assume you use anaconda for your python environment. If not, you need to change this section a bit - use virtualenv instead of setting up a conda environment. It is definitely better to keep your TensorFlow/keras etc setup out of your default Python work environment. Most of the packages are also installed with pip rather than conda, so what I use in this case is:

> conda create -n tensorflow pip
<...>
> activate tensorflow

It is important to activate the environment, otherwise you'll mess up your default conda environment! Check that your prompt says [tensorflow] before continuing (see below):

[tensorflow] > pip install matplotlib astropy pandas scikit-learn seaborn jupyter astroML
<...>
[tensorflow] > 

Install tensorflow and keras

Tensorflow is a large package - 244 Mb in my installation and it requires a fair number of additional packages so this can take a bit of time.

[tensorflow] > pip install --upgrade tensorflow
<...>
[tensorflow] > pip install --upgrade keras

That should set up you fairly well for a first dip into deep learning.

Unfortunately this might not work for you out of the box - in particular the dependency of tensorflow on grpcio can lead to a lot of problems on Mac OS. See this discussion for some discussion of this. What I ended up doing was installing grpcio via conda.

About

Repository for Machine Learning and Databases in Astronomy at CAUP in 2023

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages