CPEC

This is the official repository for the paper Leveraging conformal prediction to annotate enzyme function space with limited false positives.

Install dependencies

conda create -n cpec python=3.9
conda activate cpec

We use pytorch 1.12.1, which can be installed by following the instructions on their offical website.

pip install -r requirements.txt

Run CPEC

(Optional) Base model implementation

PenLight2

We provide the implementation and detailed instructions of PenLight2 in base_models/PenLight2.

Training

python scripts/train.py path/to/config

The configuration files can be found in directory configs. The default training configuration is configs/train_EC.yml.

Inference

python scripts/infer_max_sep.py --model_dir path/to/model_dir --test_data path/to/test_data

--model_dir: the log directory generated by scripts/train.py.

--test_data: path to the test data file.

The inference results will be saved in model_dir/max_sep_pred.csv

CLEAN

We provide the implementation and detailed instructions of CLEAN in base_models/CLEAN.

Running in Docker (CPU version)

Pull the Docker Image for AMD64 Architecture Ubuntu Machine from moleculemaker/clean-image-amd64


docker pull moleculemaker/clean-image-amd64

Running this library requires downloading huge weight files (around 7.3GB), so its better to pre-download the weight files and mount these while running the docker container. You can download these from:


curl -o esm1b_t33_650M_UR50S-contact-regression.pt https://dl.fbaipublicfiles.com/fair-esm/regression/esm1b_t33_650M_UR50S-contact-regression.pt

curl -o esm1b_t33_650M_UR50S.pt https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt

From the directory having these weight files, we are now ready to run the docker image. During this run, we will mount the above downloaded weights on the Docker container, start the container, and run the CLEAN library for a file price.fasta (which is already packaged in the image). If you wish to run this on your own FASTA file, you can copy it under /app/data/inputs directory.


sudo docker run -it -v ./:/root/.cache/torch/hub/checkpoints moleculemaker/clean-image-amd64 /bin/bash -c 'echo Starting Execution && python $(pwd)/CLEAN_infer_fasta.py --fasta_data price'

The output file will be generated under results/inputs directory with the same name as the input file.

Other base models

The user can also implement customized base models for FDR-controlled EC number prediction as long as the model can output a probability for each EC class.

False Discovery Rate (FDR)-controlled EC number prediction

In this repository, we provide an example of CPEC predicting EC numbers in notebook demo.ipynb. We provide the raw output of our base model in folder example_data/ as an example. You can test the FDR-controlled (false discovery rate) EC number prediction entirely in this repository. Even though CPEC can be applied to general machine learning methods for prediction, we used PenLight2, a contrastive learning-based model in CPEC as an illustration.

If you want to test other base models, you can normalize the predicted probabilities into $\left[0,1\right]$ and save the predicted probabilities and ground truths into tensors of the shape $\left[n_{samples}, n_{labels}\right]$. Then, using the function calibrate_fdr(), you can calculate the valid model parameter on calibration data and make FDR-controlled predictions on your own test data.

Citation

Ding, Kerr, Jiaqi Luo, and Yunan Luo. "Leveraging conformal prediction to annotate enzyme function space with limited false positives." PLOS Computational Biology 20, no. 5 (2024): e1012135.

@article{ding2024leveraging,
  title={Leveraging conformal prediction to annotate enzyme function space with limited false positives},
  author={Ding, Kerr and Luo, Jiaqi and Luo, Yunan},
  journal={PLOS Computational Biology},
  volume={20},
  number={5},
  pages={e1012135},
  year={2024},
  publisher={Public Library of Science San Francisco, CA USA}
}

Contact

Please submit GitHub issues or contact Kerr Ding (kerrding[at]gatech[dot]edu) and Yunan Luo (yunan[at]gatech[dot]edu) for any questions related to the source code.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
base_models		base_models
example_data		example_data
notebook		notebook
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CPEC

Table of contents

Install dependencies

Run CPEC

(Optional) Base model implementation

PenLight2

Training

Inference

CLEAN

Running in Docker (CPU version)

Other base models

False Discovery Rate (FDR)-controlled EC number prediction

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

luo-group/CPEC

Folders and files

Latest commit

History

Repository files navigation

CPEC

Table of contents

Install dependencies

Run CPEC

(Optional) Base model implementation

PenLight2

Training

Inference

CLEAN

Running in Docker (CPU version)

Other base models

False Discovery Rate (FDR)-controlled EC number prediction

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages