TranscriptFormer

Version v0.1.0 released 30 Apr 2025

Developed By

James D Pearce, Sara E Simmonds, Gita Mahmoudabadi, Lakshmi Krishnan, Giovanni Palla, Ana-Maria Istrate, Alexander Tarashansky, Benjamin Nelson, Omar Valenzuela, Donghui Li, Stephen R Quake, Theofanis Karaletsos, (Chan Zuckerberg Initiative)

TranscriptFormer is a family of generative models representing a cross-species generative cell atlas trained on up to 112 million cells spanning 1.53 billion years of evolution across 12 species. TranscriptFormer is designed to learn rich, context-aware representations of single-cell transcriptomes while jointly modeling genes and transcripts using a novel generative architecture. TranscriptFormer demonstrates robust zero-shot performance for cell type classification across species, disease state identification in human cells, and prediction of cell type specific transcription factors and gene-gene regulatory relationships in humans.

Read our deep dive on TranscriptFormer: A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model

Model Details

Model Architecture

  • Transformer encoder with 12 layers, 16 attention heads, hidden dimension of 2048
  • Approx. 300 million parameters in the transformer layers
  • Input embeddings derived from ESM-2 for each gene, plus an assay token to capture sequencing platform metadata
  • Expression-aware attention: expression counts are introduced as a log-count bias term in the attention matrix, avoiding explicit token duplication
  • Autoregressive generative modeling for both gene identities and their counts

Parameters

368-542 million

Citation

Pearce, J. D., et. al.. (2025). A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model. bioRxiv. DOI: 10.1101/2025.04.25.650731

Primary Contact Email

virtualcellmodels@chanzuckerberg.com

System Requirements

  • GPU (A100 40GB recommended) for efficient inference and embedding extraction.
  • Can also use a GPU with a lower amount of VRAM (16GB) by setting the inference batch size to 2.

Model Variants

Model Variant Name
Task
Access URL
TF-Metazoa
  • Trained on 112 million cells spanning all twelve species. The set covers six vertebrates (human, mouse, rabbit, chicken, African clawed frog, zebrafish), four invertebrates (sea urchin, C. elegans, fruit fly, freshwater sponge), plus a fungus (yeast) and a protist (malaria parasite). The model includes 444 million trainable parameters and 633 million non-trainable parameters (from frozen pretrained embeddings). Vocabulary size: 247,388.
  • Best generalization across diverse species.
TF-Exemplar
  • Trained on 110 million cells from human and four model organisms: mouse (M. musculus), zebrafish (D. rerio), fruit fly (D. melanogaster), and C. elegans. Total trainable parameters: 542 million; non-trainable: 282 million. Vocabulary size: 110,290.
  • Strong multi-species performance, but smaller coverage.
TF-Sapiens
  • Trained on 57 million human-only cells. This model has 368 million trainable parameters and 61 million non-trainable parameters. Vocabulary size: 23,829.
  • Trained for human-only tasks.

Intended Use

Primary Use Cases

  • Multi-species cell embeddings
  • Multi-species cell type classification
  • Disease state identification from single-cell human transcriptomes
  • Inferring gene-gene regulatory relationships
  • Inferring cell type-specific transcription factor interactions

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

  • Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
  • Any use that is prohibited by the MIT License.
  • Any use that is prohibited by the Acceptable Use Policy.

Training Details

Training Data

Species:

  • Vertebrates: (Homo sapiens, Mus musculus, Oryctolagus cuniculus, Gallus gallus, Xenopus laevis, Danio rerio)
  • Invertebrates: (Lytechinus variegatus, Caenorhabditis elegans, Drosophila melanogaster, Spongilla lacustris)
  • Fungus: Saccharomyces cerevisiae
  • Protist: Plasmodium falciparum

Training Procedure

  • Shuffling: Randomly permute expressed genes each batch to remove positional bias.
  • Context length: Up to 2,047 genes per cell and 1 assay token; unused slots masked.
  • Sampling: Up-weight low-resource species to balance against human/mouse.
  • Optimizer: AdamW, linear warm-up → cosine decay; global batch ≈ 4-5 M tokens.
  • Precision: Mixed fp16/bf16 on GPUs.
  • Infrastructure: H100 cluster with DDP.
  • Tokens processed: ~3.5 T (~20 epochs).

Speeds, Sizes, Times

  • Total training tokens: ~3.5 trillion
  • Trained for up to a number of 15 epochs
  • Final checkpoint size: ~3GB (model weights)

Training Hyperparameters

  • AdamW, learning rate max ~5.5e-5, warm-up ~10% of steps, then cosine decay
  • Mixed-precision floating point (fp16/bf16)

Data Sources

Pretraining datasets are publicly available for:

A complete summary can be found in this supplementary table.

Performance Metrics

Metrics

  • Macro F1 for multi-class cell type classification (robust to class imbalance).
  • Cross-species generalization measured by macro F1 across species at varyingly evolutionary distances from humans.

Evaluation Datasets

  • Tabula Sapiens v2 (TSv2) A human scRNA-seq reference cell atlas (Quake et al. 2024). We downloaded single-cell RNA sequencing data from the Tabula Sapiens Consortium via CZ CELLxGENE. To ensure independence from our training data, we filtered the ”All Cells” dataset to include nine donor IDs absent from Tabula Sapiens v1 (TSP17 to TSP30). We excluded pancreas data due to insufficient cell counts (<100 cells).
    • Homo sapiens Tabula Sapiens 2.0 (subset of Tabula Sapiens)
  • Cell atlas datasets: Zebrafish Danio rerio (GSE130487), Microcebus murinus (Tabula Microcebus), Sea lamprey brain Petromyzon marinus (E-MTAB-11087), Stony coral Stylophora pistillata (GSE166901), Tropical clawed frog Xenopus tropicalis (GSE113074)
  • Spermatogenesis dataset A multi-species snRNA-seq dataset of testes from the major lineages of mammals and birds (Murat et al. 2023). Data from nine species (H. sapiens, Gorilla gorilla, G. gallus, Callithrix jacchus, Macaca mulatta, Monodelphis domestica, M. musculus, Ornithorhynchus anatinus, Pan troglodytes) were downloaded from two sources (ArrayExpress and Bgee). Data were aligned and merged to create a dataset with raw counts and harmonized metadata
  • COVID-19 human lung dataset for infected vs. uninfected classification (Wu et al. 2024).

Evaluation Results

  • Tabula Sapiens v2: TF-Metazoa & TF-Exemplar achieve macro F1 up to 0.91+, outperforming prior baselines (e.g., UCE, scGPT).
  • Cross-species: Maintains F1 > 0.7 on species separated by ~600+ million years (e.g., stony coral).
  • COVID-19: F1 ~0.85-0.86 in distinguishing infected vs. healthy lung cells, higher than baseline ~0.80.

Biases, Risks, and Limitations

Potential Biases

  • Biological sampling bias: Overrepresentation of well-studied tissues and species (human, mouse) might skew performance on underrepresented organisms or rare cell types.
  • Demographic underrepresentation: Human data may reflect limited genetic ancestries or specific disease contexts.

Risks

Areas of risk may include but are not limited to:

  • Inaccurate outputs (“hallucinations”): The model may produce biologically implausible gene combinations if used for generative tasks with insufficient context.
  • Misinterpretation: Predictions for disease states or perturbations should not be used as a substitute for experimental validation.
  • Cross-species pitfalls: While the model generalizes impressively across species, extremely distant species or unusual tissues may yield lower accuracy.

Limitations

  • Does not handle spatial transcriptomics or multi-modal data (ATAC-seq, proteomics) directly.
  • Large GPU memory required for large-batch or high-throughput inference.

Caveats and Recommendations

  • Review and validate outputs generated by the model.
  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Should you have any security or privacy issues or questions related to this model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

Acknowledgements

We thank the many public scRNA-seq consortia and data repositories (e.g., CZ CELLxGENE, GEO) for making single-cell data freely available.

Special thanks to the authors of ESM-2 for enabling robust protein-based gene embeddings.

Gratitude to the collaborating labs and HPC teams that facilitated large-scale model training.

For further suggestions or inquiries, please contact virtualcellmodels@chanzuckerberg.com

If you have recommendations for this model card please contact virtualcellmodels@chanzuckerberg.com.