This is the official repository for the paper Leveraging conformal prediction to annotate enzyme function space with limited false positives.
conda create -n cpec python=3.9
conda activate cpec
We use pytorch 1.12.1, which can be installed by following the instructions on their offical website.
pip install -r requirements.txt
We provide the implementation and detailed instructions of PenLight2 in base_models/PenLight2.
python scripts/train.py path/to/config
The configuration files can be found in directory configs. The default training configuration is configs/train_EC.yml.
python scripts/infer_max_sep.py --model_dir path/to/model_dir --test_data path/to/test_data
--model_dir: the log directory generated by scripts/train.py.
--test_data: path to the test data file.
The inference results will be saved in model_dir/max_sep_pred.csv
We provide the implementation and detailed instructions of CLEAN in base_models/CLEAN.
- Pull the Docker Image for AMD64 Architecture Ubuntu Machine from moleculemaker/clean-image-amd64
docker pull moleculemaker/clean-image-amd64
- Running this library requires downloading huge weight files (around 7.3GB), so its better to pre-download the weight files and mount these while running the docker container. You can download these from:
curl -o esm1b_t33_650M_UR50S-contact-regression.pt https://dl.fbaipublicfiles.com/fair-esm/regression/esm1b_t33_650M_UR50S-contact-regression.pt
curl -o esm1b_t33_650M_UR50S.pt https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
- From the directory having these weight files, we are now ready to run the docker image. During this run, we will mount the above downloaded weights on the Docker container, start the container, and run the CLEAN library for a file price.fasta (which is already packaged in the image). If you wish to run this on your own FASTA file, you can copy it under /app/data/inputs directory.
sudo docker run -it -v ./:/root/.cache/torch/hub/checkpoints moleculemaker/clean-image-amd64 /bin/bash -c 'echo Starting Execution && python $(pwd)/CLEAN_infer_fasta.py --fasta_data price'
The output file will be generated under results/inputs directory with the same name as the input file.
The user can also implement customized base models for FDR-controlled EC number prediction as long as the model can output a probability for each EC class.
In this repository, we provide an example of CPEC predicting EC numbers in notebook demo.ipynb. We provide the raw output of our base model in folder example_data/ as an example. You can test the FDR-controlled (false discovery rate) EC number prediction entirely in this repository. Even though CPEC can be applied to general machine learning methods for prediction, we used PenLight2, a contrastive learning-based model in CPEC as an illustration.
If you want to test other base models, you can normalize the predicted probabilities into calibrate_fdr(), you can calculate the valid model parameter on calibration data and make FDR-controlled predictions on your own test data.
Ding, Kerr, Jiaqi Luo, and Yunan Luo. "Leveraging conformal prediction to annotate enzyme function space with limited false positives." PLOS Computational Biology 20, no. 5 (2024): e1012135.
@article{ding2024leveraging,
title={Leveraging conformal prediction to annotate enzyme function space with limited false positives},
author={Ding, Kerr and Luo, Jiaqi and Luo, Yunan},
journal={PLOS Computational Biology},
volume={20},
number={5},
pages={e1012135},
year={2024},
publisher={Public Library of Science San Francisco, CA USA}
}
Please submit GitHub issues or contact Kerr Ding (kerrding[at]gatech[dot]edu) and Yunan Luo (yunan[at]gatech[dot]edu) for any questions related to the source code.