Geneformer
- Christina V. Theodoris (Geneformer-V1: Dana-Farber Cancer Institute; Broad Institute of MIT and Harvard; V2: Gladstone Institutes; University of California, San Francisco)
- Han Chen (Geneformer-V2: Gladstone Institutes; University of California, San Francisco)
Geneformer is a foundational transformer model pretrained on a large-scale corpus of single cell transcriptomes (initially ~30 million, now >100 million) to gain a fundamental understanding of gene network dynamics that can now be democratized to a vast array of downstream tasks to accelerate discovery of key network regulators and candidate therapeutic targets.
Model Details
Model Architecture
- Geneformer-V1-10M: layers: 6, embedding dimensions: 256, attention heads: 4, input size: 2048
- Geneformer-V2-104M: layers: 12, embedding dimensions: 768, attention heads: 12, input size: 4096
- Geneformer-V2-316M: layers: 18, embedding dimensions: 1152, attention heads: 18, input size: 4096
Parameters
- Geneformer-V1-10M: 10 million
- Geneformer-V2-104M: 104 million
- Geneformer-V2-316M: 316 million
Citation
Geneformer-V1: Theodoris, C.V., et al. (2023). Transfer learning enables predictions in network biology. Nature 618: 616-624. DOI: 10.1038/s41586-023-06139-9
Geneformer-V2: Chen, H., et al. (2024). Quantized multi-task learning for context-specific representations of gene network dynamics. bioRxiv 2024.08.16.608180 DOI: 10.1101/2024.08.16.608180
Model Card Authors
Christina V. Theodoris
Primary Contact Email
Christina V. Theodoris christina.theodoris@gladstone.ucsf.edu
System Requirements
GPU
Model Variants
Model Variant Name | Description | Access URL |
---|---|---|
Geneformer-V1-10M | Original foundational Geneformer model pretrained on ~30M human single-cell transcriptomes (10 million parameters) | https://huggingface.co/ctheodoris/Geneformer |
Geneformer-V2-104M | V2 foundational Geneformer model pretrained on ~104M human single-cell transcriptomes (104 million parameters) | https://huggingface.co/ctheodoris/Geneformer |
Geneformer-V2-316M | V2 foundational Geneformer model pretrained on ~104M human single-cell transcriptomes (316 million parameters) | https://huggingface.co/ctheodoris/Geneformer |
CELLxGENE multitask fine-tuned Geneformer-V2-104M | Geneformer-V2-104M fine-tuned with multitask strategy on CELLxGENE attributes of cell type, cell subtype, tissue, disease, and developmental stage | https://huggingface.co/ctheodoris/Geneformer |
Cancer continual learning domain-tuned Geneformer-V2-104M | Geneformer-V2-104M with domain-specific continual learning on ~14M human single-cell transcriptomes from cancer studies for tuning to cancer domain | https://huggingface.co/ctheodoris/Geneformer |
Intended Use
Primary Use Cases
Geneformer can be used directly with zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards a relevant downstream task of interest, such as gene or cell state classification. Example applications demonstrated in our manuscript include:
- Zero-shot tasks:
- batch integration
- gene context specificity
- in silico reprogramming
- in silico differentiation
- in silico perturbation to determine impact on cell state
- in silico perturbation to determine transcription factor targets
- in silico perturbation to determine transcription factor cooperativity
- Fine-tuning tasks:
- transcription factor dosage sensitivity
- chromatin dynamics (bivalently marked promoters)
- transcription factor regulatory range
- gene network centrality
- transcription factor targets
- cell type annotation
- batch integration
- cell state classification across differentiation
- disease classification
- in silico perturbation to determine disease-driving genes
- in silico treatment to determine candidate therapeutic targets
Out-of-Scope or Unauthorized Use Cases
- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
- Replacement of diagnostic assessments.
- Any use that is prohibited by the Apache 2.0 license.
- Any use that is prohibited by the Acceptable Use Policy.
Training Details
Training Data
- Geneformer-V1: Genecorpus-30M, ~30 million human single-cell transcriptomes
- Geneformer-V2: Genecorpus-104M, ~104 million human single-cell transcriptomes
Genecorpus-30M and Genecorpus-104M were assembled in 2021 and 2024, respectively, from publicly available data representing a broad range of human tissues.
We balanced the data such that no tissue composed more than 25% of the data and performed scalable quality control filtering. We also performed deduplication of studies by DOI to preclude training with duplicated cells, which can significantly overestimate corpus size due to studies being deposited in multiple databases. The pretraining corpus also excluded cells with high mutational burdens such as malignant cells and immortalized cell lines. We excluded these cells as the high mutational burden may involve gain of function variants that alter gene functions from what the model would interpret in other cells with low mutational burdens. Each cell transcriptome was then presented to the model as a rank value encoding, which is a non-parametric representation of the transcriptome where genes are ranked by their expression in that cell scaled by their expression across the entire pretraining corpus. The scaling factor deprioritizes ubiquitously highly expressed housekeeping genes by scaling them to a lower rank. Conversely, genes such as transcription factors that may be expressed at low levels when they are expressed but have a high power to distinguish cell state will move to a higher rank within the encoding. Furthermore, the rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value whereas the overall relative ranking of genes within each cell remains more stable.
Training Procedure
Geneformer pretraining is achieved with a self-supervised masked learning objective where the model learns to predict the identity of masked genes based on the context of unmasked genes within the gene network. During pretraining, 15% of the genes within each transcriptome are masked. The model thereby gains a generalizable understanding of gene network dynamics by observing how genes interact within a vast number of gene network states. Notably, predicting masked genes within a transcriptome is not the end goal of the model and is only employed as a generalizable pretraining objective.
Training Code
All pretraining and fine-tuning code is available in the Geneformer repository on Hugging Face Model Hub.
Training Hyperparameters
Geneformer is pretrained in full precision, fp32. Please see the Geneformer repository for other hyperparameters.
Data Sources
Geneformer was trained using publicly available human single-cell transcriptomes from various sources. The Genecorpus-30M and Genecorpus-104M were assembled in 2021 and 2024, respectively.
- Genecorpus-30M: ~30 million human single-cell transcriptomes used to train Geneformer-V1
- Genecorpus-104M: ~104 million human single-cell transcriptomes used to train Geneformer-V2
Performance Metrics
Metrics
With both zero-shot learning and fine-tuning with limited task-specific data, Geneformer consistently boosted predictive accuracy in a diverse panel of downstream tasks relevant to chromatin and network dynamics. Examples include distinguishing transcription factor dosage sensitivity, bivalent chromatin dynamics, transcription factor regulatory range, gene network centrality, transcription factor targets, in silico reprogramming/differentiation, in silico perturbation to identify disease-driving genes, and in silico treatment to identify candidate therapeutic targets (please see our manuscript and the Primary Use Cases section above for more information). Importantly, Geneformer in silico perturbation led to the discovery of a novel transcription factor in cardiomyocytes that we experimentally validated to be critical to their ability to generate contractile force. Furthermore, Geneformer in silico treatment with limited patient data revealed candidate therapeutic targets for cardiomyopathy that we experimentally validated to significantly improve the ability of cardiomyocytes to generate contractile force in an induced pluripotent stem cell (iPSC) model of the disease.
Evaluation Datasets
Please see our manuscript for the evaluation datasets used.
Evaluation Results
Please see our manuscript for evaluation results.
Evaluation Metrics URL
For more evaluations on different tasks that use the AIDO.Cell embeddings (e.g., perturbation prediction), please refer to the AIDO.Cell preprint.
Biases, Risks and Limitations
Limitations
- The pretraining corpus excluded cells with high mutational burdens such as malignant cells and immortalized cell lines. We excluded these cells as the high mutational burden may involve gain of function variants that alter gene functions from what the model would interpret in other cells with low mutational burdens. As such, cancer-relevant predictions are best performed with the Geneformer version that has undergone domain-specific continual learning with ~14 million single-cell transcriptomes from cancer studies to tune the model to the cancer domain.
Caveats and Recommendations
- Review and validate outputs generated by the model.
- Fine-tuning with relevant data may improve predictions in settings that are relatively underrepresented in the pretraining data (e.g. cancer domain).
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.
Acknowledgements
We are grateful to be supported for this work by grants from the Helen Hay Whitney Foundation, National Institutes of Health (DP5OD036170), Burroughs Wellcome Fund Career Award for Medical Scientists (1022136.01), Biswas Foundation, Milken Institute, and National Science Foundation (GRF2034836).
If you have recommendations for this model card please contact virtualcellmodels@chanzuckerberg.com.