Language may be all omics needs: Harmonizing multimodal data for omics understanding with CellHermes
Maintain in Theis lab
This repository hosts the official implementation of CellHermes, a framework that can unify heterogeneous single-cell omics data by existing LLMs. Upon powerful capabilitz of LLMs, such as text-based understanding and reasoning, we can used it as encoder, predictor and explainer. This design allows CellHermes to span the entire research loop from representation learning to prediction and interpretability.
Our experiments were conducted on python=3.10.15 and our CUDA version is 12.6. We recommend using Anaconda / Miniconda to create a conda environment for using CellHermes. You can create a python environment using the following command:
conda create -n CellHermes python==3.10.15Then, you can activate the environment using and install the required packages:
conda activate CellHermes
pip install -r requirement.txtThe training of CellHermes was performed by LLaMA-Factory (version: 0.9.1), so it is needed to configure the environment according to the environment of LLaMA-Factory. This can be done by switching to the LLaMA-Factory or directly use the following command:
cd LLama_factory_v0.9.1.dev0
pip install -e ".[torch,metrics]"
We release these variants of ββCellHermesββ. Please download to the model_ckpt directory.
| Model Name | Stage | Description |
|---|---|---|
| LLaMA-3.1-8B-Instruct | Base | The base LLM used in this study |
| CellHermes | Pretraining | Spiking LLMs model on single-cell transcriptomic data and PPI network, simultaneously |
| CellHermes-Multi-Task | Instruction fine-tuning | Instruction-tuned model adapter with 7 databases across 10 tasks |
| CellHermes-T-Cell-Reactivity | Instruction fine-tuning | Instruction-tuned model adapter with T cell tumor-reactivity prediction task |
We provide the dataset in Zenodo. Please download the data to the project directory and use the following command to extract it:
cd data
unzip pretrain_datasets.zip
unzip multitask_datasets.zip
unzip perturbation_scaling_law_dataset.zip
unzip t_cell_reactivity_dataset.zip
unzip gene_level_downstream_tasks.zip
unzip cell_level_downstream_tasks.zip
unzip benchmarked_gene_embeddings.zip
unzip benchmarked_cell_embeddings.zipThe overall directory structure of the project is as follows:
βββ π scripts/ # source code
βββ π bash_config/ # training & inference config
βββ π data/ # datasets
βΒ Β βββ π pretrain_datasets/ # datasets for pretraining CellHermes
βΒ Β βββ π multitask_datasets/ # datasets for fine-tuning CellHermes for multi task prediction
βΒ Β βββ π perturbation_scaling_law_dataset/ # datasets for fine-tuning CellHermes for testing scaling law on genetic perturbation prediction
βΒ Β βββ π t_cell_reactivity_dataset/ # datasets for fine-tuning CellHermes for t cell tumor reactivity
βΒ Β βββ π gene_level_downstream_tasks/ # datasets for gene level benchmarking datasets
βΒ Β βββ π cell_level_downstream_tasks/ # datasets for cell level benchmarking datasets
βΒ Β βββ π benchmarked_gene_embeddings/ # datasets of gene embeddings from various benchmarked models
β βββ π benchmarked_cell_embeddings/ # datasets of cell embeddings from various benchmarked models on various datasets
βββ π model_ckpt/ # store the pretrained checkpoints
βΒ Β βββ π LLaMA-3.1-8B-Instruct/ # Base open-source LLM model
βΒ Β βββ π CellHermes/ # CellHermes model
βΒ Β βββ π CellHermes-Multi-Task/ # Multi-task CellHermes model
βββ βββ π CellHermes-T-Cell-Reactivity/ # T cell reactivity prediction model
Model training is conducted on 2 NVIDIA RTX A6000 GPUs.
conda activate CellHermes
cd LLama_factory_v0.9.1.dev0
bash ../bash_config/pretrain.sh
llamafactory-cli export ../bash_config/merge_lora_config.yamlThe following are commands for encoding biological entities by CellHermes, such as genes, cells and cell-specific genes.
conda activate CellHermes
python ./scripts/CellHermes_as_encoder_for_embedding.py \
-m ./model_ckpt/CellHermes \
-i "Gene BRCA1" \
-o ./output/gene_tmp_emb.pklconda activate CellHermes
python ./scripts/CellHermes_as_encoder_for_embedding.py \
-m ./model_ckpt/CellHermes \
-i "A cell with genes ranked by expression: MALAT1 TMSB4X B2M SRGN FTH1 BTG1 GNLY TPT1 EEF1A1 HLA-A ZFP36L2 PTMA HLA-B TMSB10 XCL1 PABPC1 ANXA1" \
-o ./output/cell_tmp_emb.pklObtaining gene embedding for a given gene from a specific cell with its transcriptomic information (gene rank in this case)
conda activate CellHermes
python ./scripts/CellHermes_as_encoder_for_embedding.py \
-m ./model_ckpt/CellHermes \
-i "A cell with genes ranked by expression: MALAT1 TMSB4X B2M RGS1 CCL3 CCL4 CD69 JUNB HSP90AA1 ZFP36 FTH1 DNAJB1 DUSP1 SAT1 CXCR4. In this cell, Gene BRCA1" \
-o ./output/cell_specific_gene_tmp_emb.pklThe following are commands for fine-tuning CellHermes with multiple task datasets, such as perturbation prediction, cell fitness prediction, gene interaction prediction, etc. Users can change the --dataset parameter in multitask_ft.sh file to incorporate any dataset they want.
conda activate CellHermes
cd LLama_factory_v0.9.1.dev0
bash ../bash_config/multitask_ft.shThe following are commands for inference on the one of downstream testing datasets.
conda activate CellHermes
python ./scripts/CellHermes_as_predictor_for_prediction.py \
-m ./model_ckpt/CellHermes \
-a ./model_ckpt/CellHermes-Multi-Task \
-i ./data/task_tmp.json \
-o ./output/task_tmp_predictions.jsonlThe following are commands for explaining the CellHermes's prediction results based on text-based reasoning.
conda activate CellHermes
python ./scripts/CellHermes_as_explainer_for_reasoning.py \
-m ./model_ckpt/CellHermes \
-a ./model_ckpt/CellHermes-T-Cell-Reactivity \
-i "Given a T cell from metastatic melanoma patients with its top 100 highly expressed gene list, ranked by expression level: RGS1 CCL3 CCL4 CD69 JUNB HSP90AA1. You think that this T cell is Reactive. Please explain your reasoning." \
-o ./output/cell_tmp_reasoning.pklWe gratefully acknowledge the use some of codes from the following projects: LLaMA-Factory, scGPT, GenePT. Our work builds upon their foundational contributions.
Yicheng Gao et al. Language may be all omics needs: Harmonizing multimodal data for omics understanding with CellHermes, biorxiv, 2025.
