Skip to content
/ CPEC Public

Leveraging conformal prediction to annotate enzyme function space with limited false positives

License

Notifications You must be signed in to change notification settings

luo-group/CPEC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CPEC

This is the official repository for the paper Leveraging conformal prediction to annotate enzyme function space with limited false positives.

Table of contents

Install dependencies

conda create -n cpec python=3.9
conda activate cpec

We use pytorch 1.12.1, which can be installed by following the instructions on their offical website.

pip install -r requirements.txt

Run CPEC

(Optional) Base model implementation

PenLight2

We provide the implementation and detailed instructions of PenLight2 in base_models/PenLight2.

Training
python scripts/train.py path/to/config

The configuration files can be found in directory configs. The default training configuration is configs/train_EC.yml.

Inference
python scripts/infer_max_sep.py --model_dir path/to/model_dir --test_data path/to/test_data

--model_dir: the log directory generated by scripts/train.py.

--test_data: path to the test data file.

The inference results will be saved in model_dir/max_sep_pred.csv

CLEAN

We provide the implementation and detailed instructions of CLEAN in base_models/CLEAN.

Running in Docker (CPU version)
  1. Pull the Docker Image for AMD64 Architecture Ubuntu Machine from moleculemaker/clean-image-amd64

docker pull moleculemaker/clean-image-amd64

  1. Running this library requires downloading huge weight files (around 7.3GB), so its better to pre-download the weight files and mount these while running the docker container. You can download these from:

curl -o esm1b_t33_650M_UR50S-contact-regression.pt https://dl.fbaipublicfiles.com/fair-esm/regression/esm1b_t33_650M_UR50S-contact-regression.pt

curl -o esm1b_t33_650M_UR50S.pt https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt

  1. From the directory having these weight files, we are now ready to run the docker image. During this run, we will mount the above downloaded weights on the Docker container, start the container, and run the CLEAN library for a file price.fasta (which is already packaged in the image). If you wish to run this on your own FASTA file, you can copy it under /app/data/inputs directory.

sudo docker run -it -v ./:/root/.cache/torch/hub/checkpoints moleculemaker/clean-image-amd64 /bin/bash -c 'echo Starting Execution && python $(pwd)/CLEAN_infer_fasta.py --fasta_data price'

The output file will be generated under results/inputs directory with the same name as the input file.

Other base models

The user can also implement customized base models for FDR-controlled EC number prediction as long as the model can output a probability for each EC class.

False Discovery Rate (FDR)-controlled EC number prediction

In this repository, we provide an example of CPEC predicting EC numbers in notebook demo.ipynb. We provide the raw output of our base model in folder example_data/ as an example. You can test the FDR-controlled (false discovery rate) EC number prediction entirely in this repository. Even though CPEC can be applied to general machine learning methods for prediction, we used PenLight2, a contrastive learning-based model in CPEC as an illustration.

If you want to test other base models, you can normalize the predicted probabilities into $\left[0,1\right]$ and save the predicted probabilities and ground truths into tensors of the shape $\left[n_{samples}, n_{labels}\right]$. Then, using the function calibrate_fdr(), you can calculate the valid model parameter on calibration data and make FDR-controlled predictions on your own test data.

Citation

Ding, Kerr, Jiaqi Luo, and Yunan Luo. "Leveraging conformal prediction to annotate enzyme function space with limited false positives." PLOS Computational Biology 20, no. 5 (2024): e1012135.

@article{ding2024leveraging,
  title={Leveraging conformal prediction to annotate enzyme function space with limited false positives},
  author={Ding, Kerr and Luo, Jiaqi and Luo, Yunan},
  journal={PLOS Computational Biology},
  volume={20},
  number={5},
  pages={e1012135},
  year={2024},
  publisher={Public Library of Science San Francisco, CA USA}
}

Contact

Please submit GitHub issues or contact Kerr Ding (kerrding[at]gatech[dot]edu) and Yunan Luo (yunan[at]gatech[dot]edu) for any questions related to the source code.

About

Leveraging conformal prediction to annotate enzyme function space with limited false positives

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •