This code is based on Python 3.10. Please set up the environment as follows:
conda create -n ancor python=3.10
conda activate ancor
pip install -r requirements.txt
- data splits: please use the scripts
bash scripts/download_data.shto download the data splits used in this paper. - wild-type fitness: We offer the wild-type fitness value for all the datasets we used in this paper, please find them in
data/wt_summary.xlsx.
To run the customized data using our model, please follow the steps below:
-
collect the DMS dataset and put it in
data/$dataset/data.csv, the csv file should have the following necessary columns:-
mutated_sequence: the sequence for each variant.
-
DMS_score: the ground truth fitness value for the sequence. Please ensure that higher scores indicate better fitness.
-
mutated_position: the mutant position the the assay, please note that the position number should start from 0.
-
PID: A unique number for each sequence, which can be generated by auto-increment.
-
-
Find the wild-type fitness value for the wild-type sequence, please put it in
data/wt_summary.xlsx. The file should have the following necessary columns:- protein_dataset: Your dataset name.
- wt_fitness: the ground truth fitness value for the wild-type. Please ensure that higher scores indicate better fitness.
- seq: the wild-type sequence.
We provide an example of training AnCor in the scripts/train.sh :
accelerate launch --config_file config/parallel_config.yaml ancor/train.py \
--config config/training_config.yaml \
--dataset $dataset \
--sample_seed 0 \
--model_seed 1 \
--shot 72 \
--prefix ancor_72shot
--config: (required) specifies the file containing training hyperparameters.
--dataset: (required) specifies the dataset name.
--sample_seed: (optional) specify the random seed when sampling testing and training data.
--model_seed: (optional) specify the training data seed for 5-fold training split, please choose it from 1-5.
After training, please use the following scripts to generate evaluation metrics on the test set:
python ancor/generate_metric_summary.py --dataset $dataset --shot $shot --prefix $prefix
--dataset: (required) specifies the dataset name.
--shot: (required) specifies the training size.
--prefix: (required) specifies the prefix you used in training.
the test metrics will be generated in results/summary_ancor_{args.shot}_{args.prefix}.xlsx.
-
Training size: For different training sizes, please modify shot in
config/training_config.yamland change the training hyperparameters in that file accordingly. -
GPU: We trained AnCor using 4 A40 GPUs. According to the GPU numbers you use, please modify num_processes and gpu_number respectively in
config/parallel_config.yamlandconfig/training_config.yaml.