Skip to content

luo-group/AnCor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AnCor


📚 Table of Contents

Overview


⚙️ Dependencies

This code is based on Python 3.10. Please set up the environment as follows:

conda create -n ancor python=3.10
conda activate ancor
pip install -r requirements.txt

🗂️ Data

Datasets used in the paper

  • data splits: please use the scripts bash scripts/download_data.sh to download the data splits used in this paper.
  • wild-type fitness: We offer the wild-type fitness value for all the datasets we used in this paper, please find them in data/wt_summary.xlsx.

Running on customized data

To run the customized data using our model, please follow the steps below:

  1. collect the DMS dataset and put it in data/$dataset/data.csv, the csv file should have the following necessary columns:

    • mutated_sequence: the sequence for each variant.

    • DMS_score: the ground truth fitness value for the sequence. Please ensure that higher scores indicate better fitness.

    • mutated_position: the mutant position the the assay, please note that the position number should start from 0.

    • PID: A unique number for each sequence, which can be generated by auto-increment.

  2. Find the wild-type fitness value for the wild-type sequence, please put it in data/wt_summary.xlsx. The file should have the following necessary columns:

    • protein_dataset: Your dataset name.
    • wt_fitness: the ground truth fitness value for the wild-type. Please ensure that higher scores indicate better fitness.
    • seq: the wild-type sequence.

🚀 Train AnCor

We provide an example of training AnCor in the scripts/train.sh :

accelerate launch --config_file config/parallel_config.yaml ancor/train.py \
--config config/training_config.yaml \
--dataset $dataset \
--sample_seed 0 \
--model_seed 1 \
--shot 72 \
--prefix ancor_72shot 

--config: (required) specifies the file containing training hyperparameters.

--dataset: (required) specifies the dataset name.

--sample_seed: (optional) specify the random seed when sampling testing and training data.

--model_seed: (optional) specify the training data seed for 5-fold training split, please choose it from 1-5.

After training, please use the following scripts to generate evaluation metrics on the test set:

python ancor/generate_metric_summary.py --dataset $dataset --shot $shot --prefix $prefix

--dataset: (required) specifies the dataset name.

--shot: (required) specifies the training size.

--prefix: (required) specifies the prefix you used in training.

the test metrics will be generated in results/summary_ancor_{args.shot}_{args.prefix}.xlsx.

🧬 customizing config files

  • Training size: For different training sizes, please modify shot in config/training_config.yamland change the training hyperparameters in that file accordingly.

  • GPU: We trained AnCor using 4 A40 GPUs. According to the GPU numbers you use, please modify num_processes and gpu_number respectively in config/parallel_config.yaml and config/training_config.yaml.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published