Skip to content

ai4protein/VenusFactory

Repository files navigation

VenusFactory Banner

GitHub stars GitHub forks GitHub issues GitHub license

Python Version Documentation Downloads Youtube Demo on OpenBayes

Recent News:

  • [2026-01-23] 🚀 Update: Added 30+ downsteram tasks predictions in VenusFactory.
  • [2025-08-10] 🎉 VenusFactory releases a free website at venusfactory.cn/playground/.
  • [2025-06-30] 🚀 Update: Added mutation zero-shot prediction functionality, supporting structure-based and sequence-based models for high-throughput mutation effect scoring.
  • [2025-04-19] 🎉 Congratulations! VenusREM achieves 1st place in ProteinGym and VenusMutHub leaderboard!
  • [2025-03-26] Add VenusPLM-300M model, trained based on VenusPod, is a protein language model independently developed by Hong Liang's research group at Shanghai Jiao Tong University.
  • [2025-03-17] Add Venus-PETA, Venus-ProPrime, Venus-ProSST models, for more details, please refer to Supported Models
  • [2025-03-05] 🎉 Congratulation! Our latest research achievement, VenusMutHub, has been officially accepted by Acta Pharmaceutica Sinica B and is now featured in a series of leaderboards!
📨 Welcome to join our WeChat Group

WeChat Group

📝 Your Feedback is Valuable! We invite you to complete our survey by scanning either QR code below.
Google Form QR Code

Google Form

Wenjuanxing Survey QR Code

Wenjuanxing Survey

✏️ Table of Contents

📑 Features

🙌 VenusFactory is a unified open platform for protein engineering, supporting both graphical user interface (GUI) and command-line operations. It enables data retrieval, model training, evaluation, and deployment through a streamlined, no-code workflow.

🆒 With support for local private deployment and access to over 40 state-of-the-art deep learning models, VenusFactory lowers the barrier to scientific research and accelerates the application of AI in life sciences.

  • AI-Powered Assistance: Agent-0.1 acts as an intelligent AI assistant, providing expert answers and analysis for protein engineering tasks.
  • Efficient Workflows: Quick Tools enables rapid, no-code predictions for common tasks like protein function and mutation effect scoring.
  • Advanced Analysis: Advanced Tools offers powerful, in-depth analysis for both sequence-based and structure-based zero-shot mutation predictions.
  • Various protein language models: Venus series, ESM series, ProtTrans series, Ankh series, etc.
  • Comprehensive supervised datasets: Localization, Fitness, Solubility, Stability, etc.
  • Easy and quick data collector: AlphaFold2 Database, RCSB, InterPro, UniProt, etc.
  • Experiment monitors: Wandb, Local
  • Friendly interface: Gradio UI

🧬 Supported Models (Zero-shot prediction)

Sequence-structure models

ProSST, NeurIPS2024, ProtSSN, eLife2025, MIF-ST, PEDS2022

Structure-only models

MIF, PEDS2022

Sequence-only models

ESM2, Science2023, ESM-1v, NeurIPS2021

🤖 Supported Models (Training for supervised tasks)

Pre-training Protein Language Models

Venus Series Models (Published by Liang's Lab)
Model Size Parameters GPU Memory Features Template
ProSST-20 20 110M 4GB+ Mutation AI4Protein/ProSST-20
ProSST-128 128 110M 4GB+ Mutation AI4Protein/ProSST-128
ProSST-512 512 110M 4GB+ Mutation AI4Protein/ProSST-512
ProSST-1024 1024 110M 4GB+ Mutation AI4Protein/ProSST-1024
ProSST-2048 2048 110M 4GB+ Mutation AI4Protein/ProSST-2048
ProSST-4096 4096 110M 4GB+ Mutation AI4Protein/ProSST-4096
ProPrime-690M 690M 690M 16GB+ OGT-prediction AI4Protein/Prime_690M
ProPrime-650M-OGT 650M 650M 16GB+ OGT-prediction AI4Protein/ProPrime-650M-OGT
VenusPLM-300M 300M 300M 12GB+ Protein-language AI4Protein/VenusPLM-300M

💡 These models often excel in specific tasks or offer unique architectural benefits

Venus-PETA Models: Tokenization variants

BPE Tokenization Series

Model Vocab Size Parameters GPU Memory Template
PETA-base base 80M 4GB+ AI4Protein/deep_base
PETA-bpe-50 50 80M 4GB+ AI4Protein/deep_bpe_50
PETA-bpe-100 100 80M 4GB+ AI4Protein/deep_bpe_100
PETA-bpe-200 200 80M 4GB+ AI4Protein/deep_bpe_200
PETA-bpe-400 400 80M 4GB+ AI4Protein/deep_bpe_400
PETA-bpe-800 800 80M 4GB+ AI4Protein/deep_bpe_800
PETA-bpe-1600 1600 80M 4GB+ AI4Protein/deep_bpe_1600
PETA-bpe-3200 3200 80M 4GB+ AI4Protein/deep_bpe_3200

Unigram Tokenization Series

Model Vocab Size Parameters GPU Memory Template
PETA-unigram-50 50 80M 4GB+ AI4Protein/deep_unigram_50
PETA-unigram-100 100 80M 4GB+ AI4Protein/deep_unigram_100
PETA-unigram-200 200 80M 4GB+ AI4Protein/deep_unigram_200
PETA-unigram-400 400 80M 4GB+ AI4Protein/deep_unigram_400
PETA-unigram-800 800 80M 4GB+ AI4Protein/deep_unigram_800
PETA-unigram-1600 1600 80M 4GB+ AI4Protein/deep_unigram_1600
PETA-unigram-3200 3200 80M 4GB+ AI4Protein/deep_unigram_3200

💡 Different tokenization strategies may be better suited for specific tasks

ESM Series Models: Meta AI's protein language models
Model Size Parameters GPU Memory Training Data Template
ESM2-8M 8M 8M 2GB+ UR50/D facebook/esm2_t6_8M_UR50D
ESM2-35M 35M 35M 4GB+ UR50/D facebook/esm2_t12_35M_UR50D
ESM2-150M 150M 150M 8GB+ UR50/D facebook/esm2_t30_150M_UR50D
ESM2-650M 650M 650M 16GB+ UR50/D facebook/esm2_t33_650M_UR50D
ESM2-3B 3B 3B 24GB+ UR50/D facebook/esm2_t36_3B_UR50D
ESM2-15B 15B 15B 40GB+ UR50/D facebook/esm2_t48_15B_UR50D
ESM-1b 650M 650M 16GB+ UR50/S facebook/esm1b_t33_650M_UR50S
ESM-1v-1 650M 650M 16GB+ UR90/S facebook/esm1v_t33_650M_UR90S_1
ESM-1v-2 650M 650M 16GB+ UR90/S facebook/esm1v_t33_650M_UR90S_2
ESM-1v-3 650M 650M 16GB+ UR90/S facebook/esm1v_t33_650M_UR90S_3
ESM-1v-4 650M 650M 16GB+ UR90/S facebook/esm1v_t33_650M_UR90S_4
ESM-1v-5 650M 650M 16GB+ UR90/S facebook/esm1v_t33_650M_UR90S_5

💡 ESM2 models are the latest generation, offering better performance than ESM-1b/1v

BERT-based Models: Transformer encoder architecture
Model Size Parameters GPU Memory Training Data Template
ProtBert-Uniref100 420M 420M 12GB+ UniRef100 Rostlab/prot_bert
ProtBert-BFD 420M 420M 12GB+ BFD100 Rostlab/prot_bert_bfd
IgBert 420M 420M 12GB+ Antibody Exscientia/IgBert
IgBert-unpaired 420M 420M 12GB+ Antibody Exscientia/IgBert_unpaired

💡 BFD-trained models generally show better performance on structure-related tasks

T5-based Models: Encoder-decoder architecture
Model Size Parameters GPU Memory Training Data Template
ProtT5-XL-UniRef50 3B 3B 24GB+ UniRef50 Rostlab/prot_t5_xl_uniref50
ProtT5-XXL-UniRef50 11B 11B 40GB+ UniRef50 Rostlab/prot_t5_xxl_uniref50
ProtT5-XL-BFD 3B 3B 24GB+ BFD100 Rostlab/prot_t5_xl_bfd
ProtT5-XXL-BFD 11B 11B 40GB+ BFD100 Rostlab/prot_t5_xxl_bfd
IgT5 3B 3B 24GB+ Antibody Exscientia/IgT5
IgT5-unpaired 3B 3B 24GB+ Antibody Exscientia/IgT5_unpaired
Ankh-base 450M 450M 12GB+ Encoder-decoder ElnaggarLab/ankh-base
Ankh-large 1.2B 1.2B 20GB+ Encoder-decoder ElnaggarLab/ankh-large

💡 T5 models can be used for both encoding and generation tasks

Model Selection Guide

How to choose the right model?
  1. Based on Hardware Constraints:

    • Limited GPU (<8GB): ESM2-8M, ESM2-35M, ProSST
    • Medium GPU (8-16GB): ESM2-150M, ESM2-650M, ProtBert series
    • High-end GPU (24GB+): ESM2-3B, ProtT5-XL, Ankh-large
    • Multiple GPUs: ESM2-15B, ProtT5-XXL
  2. Based on Task Type:

    • Sequence classification: ESM2, ProtBert
    • Structure prediction: ESM2, Ankh
    • Generation tasks: ProtT5
    • Antibody design: IgBert, IgT5
    • Lightweight deployment: ProSST, PETA-base
  3. Based on Training Data:

    • General protein tasks: ESM2, ProtBert
    • Structure-aware tasks: Ankh
    • Antibody-specific: IgBert, IgT5
    • Custom tokenization needs: PETA series

🔍 All models are available through the Hugging Face Hub and can be easily loaded using their templates.

🔬 Supported Training Approaches

Supported Training Approaches
Approach Full-tuning Freeze-tuning SES-Adapter AdaLoRA QLoRA LoRA DoRA IA3
Supervised Fine-Tuning

📚 Supported Datasets

Pre-training datasets
dataset data level link
CATH_V43_S40 structures CATH_V43_S40
AGO_family structures AGO_family
Zero-shot datasets
dataset task link
VenusMutHub mutation effects prediction VenusMutHub
ProteinGym mutation effects prediction ProteinGym
Supervised fine-tuning datasets (amino acid sequences/ foldseek sequences/ ss8 sequences)
dataset task data level problem type link
DeepLocBinary localization protein-wise single_label_classification DeepLocBinary_AlphaFold2, DeepLocBinary_ESMFold
DeepLocMulti localization protein-wise multi_label_classification DeepLocMulti_AlphaFold2, DeepLocMulti_ESMFold
DeepLoc2Multi localization protein-wise single_label_classification DeepLoc2Multi_AlphaFold2, DeepLoc2Multi_ESMFold
DeepSol solubility protein-wise single_label_classification DeepSol_ESMFold
DeepSoluE solubility protein-wise single_label_classification DeepSoluE_ESMFold
ProtSolM solubility protein-wise single_label_classification ProtSolM_ESMFold
eSOL solubility protein-wise regression eSOL_AlphaFold2, eSOL_ESMFold
DeepET_Topt optimum temperature protein-wise regression DeepET_Topt_AlphaFold2, DeepET_Topt_ESMFold
EC function protein-wise multi_label_classification EC_AlphaFold2, EC_ESMFold
GO_BP function protein-wise multi_label_classification GO_BP_AlphaFold2, GO_BP_ESMFold
GO_CC function protein-wise multi_label_classification GO_CC_AlphaFold2, GO_CC_ESMFold
GO_MF function protein-wise multi_label_classification GO_MF_AlphaFold2, GO_MF_ESMFold
MetalIonBinding binding protein-wise single_label_classification MetalIonBinding_AlphaFold2, MetalIonBinding_ESMFold
Thermostability stability protein-wise regression Thermostability_AlphaFold2, Thermostability_ESMFold

✨ Only structural sequences are different for the same dataset, for example, DeepLocBinary_ESMFold and DeepLocBinary_AlphaFold2 share the same amino acid sequences, this means if you only want to use the aa_seqs, both are ok!

Supervised fine-tuning datasets (amino acid sequences)
dataset task data level problem type link
Demo_Solubility solubility protein-wise single_label_classification Demo_Solubility
DeepLocBinary localization protein-wise single_label_classification DeepLocBinary
DeepLocMulti localization protein-wise multi_label_classification DeepLocMulti
DeepLoc2Multi localization protein-wise single_label_classification DeepLoc2Multi
DeepSol solubility protein-wise single_label_classification DeepSol
DeepSoluE solubility protein-wise single_label_classification DeepSoluE
ProtSolM solubility protein-wise single_label_classification ProtSolM
eSOL solubility protein-wise regression eSOL
DeepET_Topt optimum temperature protein-wise regression DeepET_Topt
EC function protein-wise multi_label_classification EC
GO_BP function protein-wise multi_label_classification GO_BP
GO_CC function protein-wise multi_label_classification GO_CC
GO_MF function protein-wise multi_label_classification GO_MF
MetalIonBinding binding protein-wise single_label_classification MetalIonBinding
Thermostability stability protein-wise regression Thermostability
PaCRISPR CRISPR protein-wise single_label_classification PaCRISPR
PETA_CHS_Sol solubility protein-wise single_label_classification PETA_CHS_Sol
PETA_LGK_Sol solubility protein-wise single_label_classification PETA_LGK_Sol
PETA_TEM_Sol solubility protein-wise single_label_classification PETA_TEM_Sol
SortingSignal sorting signal protein-wise single_label_classification SortingSignal
FLIP_AAV mutation protein-site regression
FLIP_AAV_one-vs-rest mutation protein-site single_label_classification FLIP_AAV_one-vs-rest
FLIP_AAV_two-vs-rest mutation protein-site single_label_classification FLIP_AAV_two-vs-rest
FLIP_AAV_mut-des mutation protein-site single_label_classification FLIP_AAV_mut-des
FLIP_AAV_des-mut mutation protein-site single_label_classification FLIP_AAV_des-mut
FLIP_AAV_seven-vs-rest mutation protein-site single_label_classification FLIP_AAV_seven-vs-rest
FLIP_AAV_low-vs-high mutation protein-site single_label_classification FLIP_AAV_low-vs-high
FLIP_AAV_sampled mutation protein-site single_label_classification FLIP_AAV_sampled
FLIP_GB1 mutation protein-site regression
FLIP_GB1_one-vs-rest mutation protein-site single_label_classification FLIP_GB1_one-vs-rest
FLIP_GB1_two-vs-rest mutation protein-site single_label_classification FLIP_GB1_two-vs-rest
FLIP_GB1_three-vs-rest mutation protein-site single_label_classification FLIP_GB1_three-vs-rest
FLIP_GB1_low-vs-high mutation protein-site single_label_classification FLIP_GB1_low-vs-high
FLIP_GB1_sampled mutation protein-site single_label_classification FLIP_GB1_sampled
TAPE_Fluorescence fluorescence protein-site regression TAPE_Fluorescence
TAPE_Stability stability protein-site regression TAPE_Stability

📈 Supported Metrics

Supported Metrics
Name Torchmetrics Problem Type
accuracy Accuracy single_label_classification/ multi_label_classification
recall Recall single_label_classification/ multi_label_classification
precision Precision single_label_classification/ multi_label_classification
f1 F1Score single_label_classification/ multi_label_classification
mcc MatthewsCorrCoef single_label_classification/ multi_label_classification
auc AUROC single_label_classification/ multi_label_classification
f1_max F1ScoreMax multi_label_classification
spearman_corr SpearmanCorrCoef regression
mse MeanSquaredError regression

🤖 Agent-0.1: Your AI Assistant

Agent-0.1 is an intelligent AI assistant integrated into the VenusFactory platform, designed to answer questions and provide in-depth analysis on protein engineering and bioinformatics. It acts as a specialized expert, helping both biologists and AI researchers streamline their research workflow.

Key Features:

  • Zero-shot Prediction: Directly utilize cutting-edge sequence-based (e.g., ESM-2, ESM-1v, ESM-1b) and structure-based models (e.g., SaProt, ProtSSN, ESM-IF1, MIF-ST, ProSST) to perform zero-shot mutation prediction.
  • Protein Function: Accurately predict various protein functions, including solubility, localization, metal ion binding, stability, sorting signal, and optimum temperature.
  • Clear Insights: Always provides clear, actionable insights in response to your queries.

💡 Note: This feature requires an API key to access and is currently in Beta.

⚡ Quick Tools: Your Go-to for Rapid Predictions

Quick Tools is designed for users who need fast, efficient, and straightforward analysis without extensive configuration. It provides a no-code entry point to two key prediction tasks.

  • Directed Evolution: AI-Powered Mutation Prediction This tool allows for the rapid scoring and analysis of protein mutations. Simply upload a PDB file or paste the PDB content, and the platform will provide insights into the effects of single or multiple mutations on the protein.

  • Protein Function Leveraging pre-trained models, this module predicts various protein functions from a given amino acid sequence. You can upload a FASTA file or paste the sequence directly to predict properties such as solubility, localization, and more.

🧪 Advanced Tools: For In-depth Protein Analysis

Advanced Tools is built for researchers who require more granular control and deeper analysis. It offers powerful zero-shot prediction capabilities by allowing you to choose between two distinct model types.

  • Sequence-based Model This submodule focuses on high-throughput mutation effect scoring using powerful sequence-only models like ESM-2. You can upload a FASTA file or paste a protein sequence to perform large-scale predictions and score mutations.

  • Structure-based Model For tasks that require a deep understanding of protein 3D geometry, this tool utilizes structure-aware models like ESM-IF1. By uploading a PDB file or pasting its content, you can perform sophisticated zero-shot predictions that take the protein's spatial context into account.

✈️ Requirements

Hardware Requirements

  • Recommended: NVIDIA RTX 3090 (24GB) or better
  • Actual requirements depend on your chosen protein language model

Software Requirements

📦 Installation Guide

Git start with macOS

To achieve the best performance and experience, we recommend using ​Mac devices with M-series chips (such as M1, M2, M3, etc.).

1️⃣ Clone the repository

First, get the VenusFactory code:

git clone https://github.com/AI4Protein/VenusFactory.git
cd VenusFactory

2️⃣ Create a Conda environment

Ensure you have Anaconda or Miniconda installed. Then, create a new environment named venus with Python 3.12:

conda create -n venus python=3.12
conda activate venus

3️⃣ Install Pytorch and PyG dependencies

# Install PyTorch
pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

# Install PyG dependencies
pip install torch_scatter torch-sparse torch-geometric -f https://data.pyg.org/whl/torch-2.8.0+cpu.html

4️⃣ Install remaining dependencies

Install the remaining dependencies using requirements_for_macOS.txt:

pip install -r requirements_for_macOS.txt
Git start with Windows or Linux on CUDA 12.8

We recommend using CUDA 12.8

1️⃣ Clone the repository

First, get the VenusFactory code:

git clone https://github.com/AI4Protein/VenusFactory.git
cd VenusFactory

2️⃣ Create a Conda environment

Ensure you have Anaconda or Miniconda installed. Then, create a new environment named venus with Python 3.12:

conda create -n venus python=3.12
conda activate venus

3️⃣ Install Pytorch and PyG dependencies

# Install PyTorch
pip install torch==2.8.0 torchvision --index-url https://download.pytorch.org/whl/cu128

# Install PyG dependencies
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.8.0+cu128.html

4️⃣ Install remaining dependencies

Install the remaining dependencies using requirements.txt:

pip install -r requirements.txt
Git start with Windows or Linux on CUDA 11.x

We recommend using CUDA 11.8 or later versions, as they support higher versions of PyTorch, providing a better experience.

1️⃣ Clone the repository

First, get the VenusFactory code:

git clone https://github.com/AI4Protein/VenusFactory.git
cd VenusFactory

2️⃣ Create a Conda environment

Ensure you have Anaconda or Miniconda installed. Then, create a new environment named venus with Python 3.12:

conda create -n venus python=3.12
conda activate venus

3️⃣ Install Pytorch and PyG dependencies

# Install PyTorch
pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu118

# Install PyG dependencies
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.7.0+cu118.html

4️⃣ Install remaining dependencies

Install the remaining dependencies using requirements.txt:

pip install -r requirements.txt
Git start with Windows or Linux on CPU

1️⃣ Clone the repository

First, get the VenusFactory code:

git clone https://github.com/AI4Protein/VenusFactory.git
cd VenusFactory

2️⃣ Create a Conda environment

Ensure you have Anaconda or Miniconda installed. Then, create a new environment named venus with Python 3.12:

conda create -n venus python=3.12
conda activate venus

3️⃣ Install Pytorch and PyG dependencies

# Install PyTorch
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

# Install PyG dependencies
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.8.0+cpu.html

4️⃣ Install remaining dependencies

Install the remaining dependencies using requirements.txt:

pip install -r requirements.txt

🚀 Quick Start with Venus Web UI

Start Venus Web UI

Get started quickly with our intuitive graphical interface powered by Gradio:

python ./src/webui.py

This will launch the Venus Web UI where you can:

  • Configure and run fine-tuning experiments
  • Monitor training progress
  • Evaluate models
  • Visualize results

Using Each Tab

We provide a detailed guide to help you navigate through each tab of the Venus Web UI.

1. Training Tab: Train your own protein language model

Model_Dataset_Config

Select a protein language model from the dropdown menu. Upload your dataset or select from available datasets and choose metrics appropriate for your problem type.

Training_Parameters Choose a training method (Freeze, SES-Adapter, LoRA, QLoRA etc.) and configure training parameters (batch size, learning rate, etc.).

Preview_Command Training_Progress Best_Model Monitor_Figs Click "Start Training" and monitor progress in real-time.

Metric_Results

Click "Download CSV" to download the test metrics results.

2. Evaluation Tab: Evaluate your trained model within a benchmark

Model_Dataset_Config

Load your trained model by specifying the model path. Select the same protein language model and model configs used during training. Select a test dataset and configure batch size. Choose evaluation metrics appropriate for your problem type. Finally, click "Start Evaluation" to view performance metrics.

3. Prediction Tab: Use your trained model to predict samples

Predict_Tab

Load your trained model by specifying the model path. Select the same protein language model and model configs used during training.

For single sequence: Enter a protein sequence in the text box.

For batch prediction: Upload a CSV file with sequences.

Batch

Click "Predict" to generate and view results.

4. Download Tab: Collect data from different sources with high efficiency
  • AlphaFold2 Structures: Enter UniProt IDs to download protein structures
  • UniProt: Search for protein information using keywords or IDs
  • InterPro: Retrieve protein family and domain information
  • RCSB PDB: Download experimental protein structures
5. Manual Tab: Detailed documentation and guides

Select a language (English/Chinese).

Navigate through the documentation using the table of contents and find step-by-step guides.

🧬 Code-line Usage

For users who prefer command-line interface, we provide comprehensive script solutions for different scenarios.

Training Methods: Various fine-tuning approaches for different needs

Full Model Fine-tuning

# Freeze-tuning: Train only specific layers while freezing others
bash ./script/train/train_plm_vanilla.sh

Parameter-Efficient Fine-tuning (PEFT)

# SES-Adapter: Selective and Efficient adapter fine-tuning
bash ./script/train/train_plm_ses-adapter.sh

# AdaLoRA: Adaptive Low-Rank Adaptation
bash ./script/train/train_plm_adalora.sh

# QLoRA: Quantized Low-Rank Adaptation
bash ./script/train/train_plm_qlora.sh

# LoRA: Low-Rank Adaptation
bash ./script/train/train_plm_lora.sh

# DoRA: Double Low-Rank Adaptation
bash ./script/train/train_plm_dora.sh

# IA3: Infused Adapter by Inhibiting and Amplifying Inner Activations
bash ./script/train/train_plm_ia3.sh

Training Method Comparison

Method Memory Usage Training Speed Performance
Freeze Low Fast Good
SES-Adapter Medium Medium Better
AdaLoRA Low Medium Better
QLoRA Very Low Slower Good
LoRA Low Fast Good
DoRA Low Medium Better
IA3 Very Low Fast Good
Model Evaluation: Comprehensive evaluation tools

Basic Evaluation

# Evaluate model performance on test sets
bash ./script/eval/eval.sh

Available Metrics

  • Classification: accuracy, precision, recall, F1, MCC, AUC
  • Regression: MSE, Spearman correlation
  • Multi-label: F1-max

Visualization Tools

  • Training curves
  • Confusion matrices
  • ROC curves
  • Performance comparison plots
Structure Sequence Tools: Process protein structure information

ESM Structure Sequence

# Generate structure sequences using ESM-3
bash ./script/get_get_structure_seq/get_esm3_structure_seq.sh

Secondary Structure

# Predict protein secondary structure
bash ./script/get_get_structure_seq/get_secondary_structure_seq.sh

Features:

  • Support for multiple sequence formats
  • Batch processing capability
  • Integration with popular structure prediction tools
Data Collection Tools: Multi-source protein data acquisition

Format Conversion

# Convert CIF format to PDB
bash ./script/tools/file/convert/maxit.sh

Metadata Collection

# Download metadata from RCSB PDB
bash ./script/tools/search/database/rcsb/download_rcsb_meta.sh

Sequence Data

# Download protein sequences from UniProt
bash ./script/tools/search/database/uniprot/download_uniprot_seq.sh

Structure Data

# Download from AlphaFold2 Database
bash ./script/tools/search/database/alphafold/download_alphafold_structure.sh

# Download from RCSB PDB
bash ./script/tools/search/database/rcsb/download_rcsb_structure.sh

Features:

  • Automated batch downloading
  • Resume interrupted downloads
  • Data integrity verification
  • Multiple source support
  • Customizable search criteria

Supported Databases

Database Data Type Access Method Rate Limit
AlphaFold2 Structures REST API Yes
RCSB PDB Structures FTP/HTTP No
UniProt Sequences REST API Yes
InterPro Domains REST API Yes
Usage Examples: Common scenarios and solutions

Training Example

# Train a protein solubility predictor using ESM2
bash ./script/train/train_plm_lora.sh \
    --model "facebook/esm2_t33_650M_UR50D" \
    --dataset "DeepSol" \
    --batch_size 32 \
    --learning_rate 1e-4

Evaluation Example

# Evaluate the trained model
bash ./script/eval/eval.sh \
    --model_path "path/to/your/model" \
    --test_dataset "DeepSol_test"

Data Collection Example

# Download structures for a list of UniProt IDs
bash ./script/tools/search/database/alphafold/download_alphafold_structure.sh

💡 All scripts support additional command-line arguments for customization. Use --help with any script to see available options.

🙌 Citation

Please cite our work if you have used our code or data.

@inproceedings{tan2025venusfactory,
  title={VenusFactory: An Integrated System for Protein Engineering with Data Retrieval and Language Model Fine-Tuning},
  author={Tan, Yang and Liu, Chen and Gao, Jingyuan and Banghao, Wu and Li, Mingchen and Wang, Ruilin and Zhang, Lingrong and Yu, Huiqun and Fan, Guisheng and Hong, Liang and others},
  booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)},
  pages={230--241},
  year={2025}
}

🎊 Acknowledgement

Thanks the support of Liang's Lab.

About

🏭 AI agent platform for protein engineering. (ACL Demo 2025)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors