AnCor

📚 Table of Contents

Dependencies
Data
- Datasets used in the paper
- Running on customized data
Train AnCor
- Customizing config files

⚙️ Dependencies

This code is based on Python 3.10. Please set up the environment as follows:

conda create -n ancor python=3.10
conda activate ancor
pip install -r requirements.txt

🗂️ Data

Datasets used in the paper

data splits: please use the scripts bash scripts/download_data.sh to download the data splits used in this paper.
wild-type fitness: We offer the wild-type fitness value for all the datasets we used in this paper, please find them in data/wt_summary.xlsx.

Running on customized data

To run the customized data using our model, please follow the steps below:

collect the DMS dataset and put it in data/$dataset/data.csv, the csv file should have the following necessary columns:
- mutated_sequence: the sequence for each variant.
- DMS_score: the ground truth fitness value for the sequence. Please ensure that higher scores indicate better fitness.
- mutated_position: the mutant position the the assay, please note that the position number should start from 0.
- PID: A unique number for each sequence, which can be generated by auto-increment.
Find the wild-type fitness value for the wild-type sequence, please put it in data/wt_summary.xlsx. The file should have the following necessary columns:
- protein_dataset: Your dataset name.
- wt_fitness: the ground truth fitness value for the wild-type. Please ensure that higher scores indicate better fitness.
- seq: the wild-type sequence.

🚀 Train AnCor

We provide an example of training AnCor in the scripts/train.sh :

accelerate launch --config_file config/parallel_config.yaml ancor/train.py \
--config config/training_config.yaml \
--dataset $dataset \
--sample_seed 0 \
--model_seed 1 \
--shot 72 \
--prefix ancor_72shot

--config: (required) specifies the file containing training hyperparameters.

--dataset: (required) specifies the dataset name.

--sample_seed: (optional) specify the random seed when sampling testing and training data.

--model_seed: (optional) specify the training data seed for 5-fold training split, please choose it from 1-5.

After training, please use the following scripts to generate evaluation metrics on the test set:

python ancor/generate_metric_summary.py --dataset $dataset --shot $shot --prefix $prefix

--dataset: (required) specifies the dataset name.

--shot: (required) specifies the training size.

--prefix: (required) specifies the prefix you used in training.

the test metrics will be generated in results/summary_ancor_{args.shot}_{args.prefix}.xlsx.

🧬 customizing config files

Training size: For different training sizes, please modify shot in config/training_config.yamland change the training hyperparameters in that file accordingly.
GPU: We trained AnCor using 4 A40 GPUs. According to the GPU numbers you use, please modify num_processes and gpu_number respectively in config/parallel_config.yaml and config/training_config.yaml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AnCor

📚 Table of Contents

⚙️ Dependencies

🗂️ Data

Datasets used in the paper

Running on customized data

🚀 Train AnCor

🧬 customizing config files

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ancor		ancor
config		config
data		data
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

luo-group/AnCor

Folders and files

Latest commit

History

Repository files navigation

AnCor

📚 Table of Contents

⚙️ Dependencies

🗂️ Data

Datasets used in the paper

Running on customized data

🚀 Train AnCor

🧬 customizing config files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages