Skip to content

onedeeper/thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Thesis

This repository contains all relevant code for my Masters thesis at Tilburg University : Msc in Cognitive Science & Artificial Intelligence.

The main package developed for the project is eeglearn which is inside the eeg-graph-learning directory.

Project overview

TL;DR

Self-supervised pre-training is the engine that powers all of the latest generations of language models. Here, the training objective is next token prediction.

Similarly, this method has been attempted for EEG data. Furthermore, researchers have attempted to incorporate information about the placement of the EEG electrodes into the training process with Graph Neural networks.

Goal of the project

alt text

Testing self-supervised pre-training on pretext tasks with graph neural networks [1-5] on a benchmark dataset [6].

Personal goals

  1. One of my personal goals is to make the code I wrote for my Master's thesis better than the code I wrote for bachelor's thesis ( the data extraction package and the analysis code).

    To that end, I have tried very hard to:

    1. Incorporate software engineering best principles (as far as I can learn them from disparate sources online or from asking an LLM).
    2. Make the code reproducible to the best of my ability ( more info on how I tried to do that )
  2. Understand what coding LLMs can and cannot do.

    I have opted for the Cursor IDE. I usually ask questions with Claude 3.7 Sonnet. My impressions (updated regularly) are :

    1. Some context confuses it: Given the niche of this project, I have found that the some of the language used in the project (e.g., EEGs, power spectrum density, etc.) causes the models some trouble.
    2. The "tab to complete" feature causes more poblems than it solves : So I have opted to turn it off. It is way too verbose and tries to do things that are unnecessary. However, if I need to do some quick refactoring (e.g., I want to delete the use of a variable throughout), this is quite useful.
    3. Updating documentation : Models are quite useful for creating general purpose documentation (e.g, how to install a library) but get confused when the code becomes niche.
    4. Writing test cases : In the begining of the project, I wrote test cases with Claude. However, as I got to learn testing more, I realized it was writing some redundant or over constrained tests (i.e., it was too easy to pass).

I also focused on improving on the work I was doing week after week. This means I incorporated good ideas as I encountered them. This also means some earlier files are not perhaps as clean and bug free as they could be. For example, later files will likely have better test coverage. I hope to return to these once the core deliverables for my thesis have been completed.

A short timeline (updated regularly)

This project was built over several months.

  • February
    1. Initial discussion and familiarization with the project idea
    2. Familiarization with previous code base and work done by former student.
  • March
    1. Begin work on building a feature store. Basic idea was generate the features necessary in papers [1]-[5].
    2. Began to incorporate testing with PyTest
    3. Experimented with writing tests with Claude
  • April
    1. Some major improvements on code quality:

      1. Started using Ruff for linting the code (linting is exactly as it sounds, like you would lint a jacket for debris, Ruff will lint your code for stuff you don't need and stuff you are missing.)
      2. I wanted to get better at thinking ahead.
        1. After listening to some podcasts (here and here) with world-class engineers, decided to use assert statements everywhere. At minimum , 2 for each function following the NASA principles for safety-critical code. I aim to adhere to atleast rules 1, 2, and 5.
        2. Started adding type hints to both function calls, and returns as well as any variables that I am declaring anywhere. I want to get used to this, and hopefully soon get familiar with C/C++ which has even stricter requirements (for which the original NASA rules were designed).
        3. Started using coverage tests with PyTest -cov. Essentially, this checks which parts of your code are being hit by your tests and which are not ( more my experience with this here)..
      3. Started writing with the red-green refactor method. I find this process to be extremely satisfying. The similarity between this method and the scientific method (i.e., conjectures and refutations) is just beautiful.
    2. Realized a simple way to reduce code repetition:

      1. One of the challenges in building the processing pipeline is dealing with different shapes of data for epoched and non-epoched cases. If a recording is epoched, it would be of shape (n_epochs x n_channels x ...). Otherwise, it would start with (n_channels x ...). In late april, I realized that if I simply reshape all non-epoched data to have 1 epoch dimension, I could have reduce a ton of code duplication and decisions in the code. At the moment, I am going to implement this going forward, without doing a full refactor. This change will be implemented from the get_spatial_permutations method onwards.
      2. Started writing with the red-green refactor method. I find this process to be extremely satisfying. The similarity between this method and the scientific method (i.e., conjectures and refutations) is just beautiful.
    3. Realized a simple way to reduce code repetition:

      1. One of the challenges in building the processing pipeline is dealing with different shapes of data for epoched and non-epoched cases. If a recording is epoched, it would be of shape (n_epochs x n_channels x ...). Otherwise, it would start with (n_channels x ...). In late april, I realized that if I simply reshape all non-epoched data to have 1 epoch dimension, I could have reduce a ton of code duplication and decisions in the code. At the moment, I am going to implement this going forward, without doing a full refactor. This change will be implemented from the get_spatial_permutations method onwards.

References

  1. Tang, S., Dunnmon, J. A., Saab, K., Zhang, X., Huang, Q., Dubost, F., ... & Lee-Messer, C. (2021). Automated seizure detection and seizure type classification from electroencephalography with a graph neural network and self-supervised pre-training. arXiv preprint arXiv:2104.08336, 10.

  2. Li, Y., Chen, J., Li, F., Fu, B., Wu, H., Ji, Y., ... & Zheng, W. (2022). GMSS: Graph-based multi-task self-supervised learning for EEG emotion recognition. IEEE Transactions on Affective Computing, 14(3), 2512-2525.

  3. Qiu, L., Zhong, L., Li, J., Feng, W., Zhou, C., & Pan, J. (2024). SFT-SGAT: A semi-supervised fine-tuning self-supervised graph attention network for emotion recognition and consciousness detection. Neural Networks, 180, 106643.

  4. Zeng, Y., Lin, J., Li, Z., Xiao, Z., Wang, C., Ge, X., ... & Liu, M. (2024). Adaptive node feature extraction in graph-based neural networks for brain diseases diagnosis using self-supervised learning. NeuroImage, 297, 120750.

  5. Van Dijk, H., Van Wingen, G., Denys, D., Olbrich, S., Van Ruth, R., & Arns, M. (2022). The two decades brainclinics research archive for insights in neurophysiology (TDBRAIN) database. Scientific Data, 9(1), 333. https://doi.org/10.1038/s41597-022-01474-2x

About

Repo for masters thesis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •