Geneformer

License
Apache 2.0
Developed By
  • Christina V. Theodoris (Geneformer-V1: Dana-Farber Cancer Institute; Broad Institute of MIT and Harvard; V2: Gladstone Institutes; University of California, San Francisco)
  • Han Chen (Geneformer-V2: Gladstone Institutes; University of California, San Francisco)

Geneformer is a foundational transformer model pretrained on a large-scale corpus of single cell transcriptomes (initially ~30 million, now >100 million) to gain a fundamental understanding of gene network dynamics that can now be democratized to a vast array of downstream tasks to accelerate discovery of key network regulators and candidate therapeutic targets.

Model Details

Model Architecture

  • Geneformer-V1-10M: layers: 6, embedding dimensions: 256, attention heads: 4, input size: 2048
  • Geneformer-V2-104M: layers: 12, embedding dimensions: 768, attention heads: 12, input size: 4096
  • Geneformer-V2-316M: layers: 18, embedding dimensions: 1152, attention heads: 18, input size: 4096

Parameters

  • Geneformer-V1-10M: 10 million
  • Geneformer-V2-104M: 104 million
  • Geneformer-V2-316M: 316 million

Citation

Geneformer-V1: Theodoris, C.V., et al. (2023). Transfer learning enables predictions in network biology. Nature 618: 616-624. DOI: 10.1038/s41586-023-06139-9

Geneformer-V2: Chen, H., et al. (2024). Quantized multi-task learning for context-specific representations of gene network dynamics. bioRxiv 2024.08.16.608180 DOI: 10.1101/2024.08.16.608180

Model Card Authors

Christina V. Theodoris

Primary Contact Email

Christina V. Theodoris christina.theodoris@gladstone.ucsf.edu

System Requirements

GPU

Model Variants

Model Variant Name
Description
Access URL
Geneformer-V1-10MOriginal foundational Geneformer model pretrained on ~30M human single-cell transcriptomes (10 million parameters)https://huggingface.co/ctheodoris/Geneformer
Geneformer-V2-104MV2 foundational Geneformer model pretrained on ~104M human single-cell transcriptomes (104 million parameters)https://huggingface.co/ctheodoris/Geneformer
Geneformer-V2-316MV2 foundational Geneformer model pretrained on ~104M human single-cell transcriptomes (316 million parameters)https://huggingface.co/ctheodoris/Geneformer
CELLxGENE multitask fine-tuned Geneformer-V2-104MGeneformer-V2-104M fine-tuned with multitask strategy on CELLxGENE attributes of cell type, cell subtype, tissue, disease, and developmental stagehttps://huggingface.co/ctheodoris/Geneformer
Cancer continual learning domain-tuned Geneformer-V2-104MGeneformer-V2-104M with domain-specific continual learning on ~14M human single-cell transcriptomes from cancer studies for tuning to cancer domainhttps://huggingface.co/ctheodoris/Geneformer

Intended Use

Primary Use Cases

Geneformer can be used directly with zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards a relevant downstream task of interest, such as gene or cell state classification. Example applications demonstrated in our manuscript include:

  • Zero-shot tasks:
    • batch integration
    • gene context specificity
    • in silico reprogramming
    • in silico differentiation
    • in silico perturbation to determine impact on cell state
    • in silico perturbation to determine transcription factor targets
    • in silico perturbation to determine transcription factor cooperativity
  • Fine-tuning tasks:
    • transcription factor dosage sensitivity
    • chromatin dynamics (bivalently marked promoters)
    • transcription factor regulatory range
    • gene network centrality
    • transcription factor targets
    • cell type annotation
    • batch integration
    • cell state classification across differentiation
    • disease classification
    • in silico perturbation to determine disease-driving genes
    • in silico treatment to determine candidate therapeutic targets

Out-of-Scope or Unauthorized Use Cases

  • Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
  • Replacement of diagnostic assessments.
  • Any use that is prohibited by the Apache 2.0 license.
  • Any use that is prohibited by the Acceptable Use Policy.

Training Details

Training Data

  • Geneformer-V1: Genecorpus-30M, ~30 million human single-cell transcriptomes
  • Geneformer-V2: Genecorpus-104M, ~104 million human single-cell transcriptomes

Genecorpus-30M and Genecorpus-104M were assembled in 2021 and 2024, respectively, from publicly available data representing a broad range of human tissues.

We balanced the data such that no tissue composed more than 25% of the data and performed scalable quality control filtering. We also performed deduplication of studies by DOI to preclude training with duplicated cells, which can significantly overestimate corpus size due to studies being deposited in multiple databases. The pretraining corpus also excluded cells with high mutational burdens such as malignant cells and immortalized cell lines. We excluded these cells as the high mutational burden may involve gain of function variants that alter gene functions from what the model would interpret in other cells with low mutational burdens. Each cell transcriptome was then presented to the model as a rank value encoding, which is a non-parametric representation of the transcriptome where genes are ranked by their expression in that cell scaled by their expression across the entire pretraining corpus. The scaling factor deprioritizes ubiquitously highly expressed housekeeping genes by scaling them to a lower rank. Conversely, genes such as transcription factors that may be expressed at low levels when they are expressed but have a high power to distinguish cell state will move to a higher rank within the encoding. Furthermore, the rank-based approach may be more robust against technical artifacts that may systematically bias the absolute transcript counts value whereas the overall relative ranking of genes within each cell remains more stable.

Training Procedure

Geneformer pretraining is achieved with a self-supervised masked learning objective where the model learns to predict the identity of masked genes based on the context of unmasked genes within the gene network. During pretraining, 15% of the genes within each transcriptome are masked. The model thereby gains a generalizable understanding of gene network dynamics by observing how genes interact within a vast number of gene network states. Notably, predicting masked genes within a transcriptome is not the end goal of the model and is only employed as a generalizable pretraining objective.

Training Code

All pretraining and fine-tuning code is available in the Geneformer repository on Hugging Face Model Hub.

Training Hyperparameters

Geneformer is pretrained in full precision, fp32. Please see the Geneformer repository for other hyperparameters.

Data Sources

Geneformer was trained using publicly available human single-cell transcriptomes from various sources. The Genecorpus-30M and Genecorpus-104M were assembled in 2021 and 2024, respectively.

  • Genecorpus-30M: ~30 million human single-cell transcriptomes used to train Geneformer-V1
  • Genecorpus-104M: ~104 million human single-cell transcriptomes used to train Geneformer-V2

Performance Metrics

Metrics

With both zero-shot learning and fine-tuning with limited task-specific data, Geneformer consistently boosted predictive accuracy in a diverse panel of downstream tasks relevant to chromatin and network dynamics. Examples include distinguishing transcription factor dosage sensitivity, bivalent chromatin dynamics, transcription factor regulatory range, gene network centrality, transcription factor targets, in silico reprogramming/differentiation, in silico perturbation to identify disease-driving genes, and in silico treatment to identify candidate therapeutic targets (please see our manuscript and the Primary Use Cases section above for more information). Importantly, Geneformer in silico perturbation led to the discovery of a novel transcription factor in cardiomyocytes that we experimentally validated to be critical to their ability to generate contractile force. Furthermore, Geneformer in silico treatment with limited patient data revealed candidate therapeutic targets for cardiomyopathy that we experimentally validated to significantly improve the ability of cardiomyocytes to generate contractile force in an induced pluripotent stem cell (iPSC) model of the disease.

Evaluation Datasets

Please see our manuscript for the evaluation datasets used.

Evaluation Results

Please see our manuscript for evaluation results.

Evaluation Metrics URL

For more evaluations on different tasks that use the AIDO.Cell embeddings (e.g., perturbation prediction), please refer to the AIDO.Cell preprint.

Biases, Risks and Limitations

Limitations

  • The pretraining corpus excluded cells with high mutational burdens such as malignant cells and immortalized cell lines. We excluded these cells as the high mutational burden may involve gain of function variants that alter gene functions from what the model would interpret in other cells with low mutational burdens. As such, cancer-relevant predictions are best performed with the Geneformer version that has undergone domain-specific continual learning with ~14 million single-cell transcriptomes from cancer studies to tune the model to the cancer domain.

Caveats and Recommendations

  • Review and validate outputs generated by the model.
  • Fine-tuning with relevant data may improve predictions in settings that are relatively underrepresented in the pretraining data (e.g. cancer domain).
  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Acknowledgements

We are grateful to be supported for this work by grants from the Helen Hay Whitney Foundation, National Institutes of Health (DP5OD036170), Burroughs Wellcome Fund Career Award for Medical Scientists (1022136.01), Biswas Foundation, Milken Institute, and National Science Foundation (GRF2034836).

If you have recommendations for this model card please contact virtualcellmodels@chanzuckerberg.com.