Preprint

for our newsletter to be notified when the TranscriptFormer model card is updated with more information.

TranscriptFormer

Version v0.1.0 released 30 Apr 2025

License

Repository

https://github.com/czi-ai/transcriptformer

Developed By

James D Pearce, Sara E Simmonds, Gita Mahmoudabadi, Lakshmi Krishnan, Giovanni Palla, Ana-Maria Istrate, Alexander Tarashansky, Benjamin Nelson, Omar Valenzuela, Donghui Li, Stephen R Quake, Theofanis Karaletsos, (Chan Zuckerberg Initiative)

TranscriptFormer is a family of generative models representing a cross-species generative cell atlas trained on up to 112 million cells spanning 1.53 billion years of evolution across 12 species. TranscriptFormer is designed to learn rich, context-aware representations of single-cell transcriptomes while jointly modeling genes and transcripts using a novel generative architecture. TranscriptFormer demonstrates robust zero-shot performance for cell type classification across species, disease state identification in human cells, and prediction of cell type specific transcription factors and gene-gene regulatory relationships in humans.

Get Started with Model

Read our deep dive on TranscriptFormer: A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model

Model Details

Model Architecture

Transformer encoder with 12 layers, 16 attention heads, hidden dimension of 2048
Approx. 300 million parameters in the transformer layers
Input embeddings derived from ESM-2 for each gene, plus an assay token to capture sequencing platform metadata
Expression-aware attention: expression counts are introduced as a log-count bias term in the attention matrix, avoiding explicit token duplication
Autoregressive generative modeling for both gene identities and their counts

Parameters

368-542 million

Citation

Pearce, J. D., et. al.. (2025). A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model. bioRxiv. DOI: 10.1101/2025.04.25.650731

Primary Contact Email

virtualcellmodels@chanzuckerberg.com

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

GPU (A100 40GB recommended) for efficient inference and embedding extraction.
Can also use a GPU with a lower amount of VRAM (16GB) by setting the inference batch size to 2.

Model Variants

Model Variant Name	Task	Access URL
TF-Metazoa	Trained on 112 million cells spanning all twelve species. The set covers six vertebrates (human, mouse, rabbit, chicken, African clawed frog, zebrafish), four invertebrates (sea urchin, C. elegans, fruit fly, freshwater sponge), plus a fungus (yeast) and a protist (malaria parasite). The model includes 444 million trainable parameters and 633 million non-trainable parameters (from frozen pretrained embeddings). Vocabulary size: 247,388. Best generalization across diverse species.	s3://czi-transcriptformer/weights/tf_metazoa.tar.gz
TF-Exemplar	Trained on 110 million cells from human and four model organisms: mouse (M. musculus), zebrafish (D. rerio), fruit fly (D. melanogaster), and C. elegans. Total trainable parameters: 542 million; non-trainable: 282 million. Vocabulary size: 110,290. Strong multi-species performance, but smaller coverage.	s3://czi-transcriptformer/weights/tf_exemplar.tar.gz
TF-Sapiens	Trained on 57 million human-only cells. This model has 368 million trainable parameters and 61 million non-trainable parameters. Vocabulary size: 23,829. Trained for human-only tasks.	s3://czi-transcriptformer/weights/tf_sapiens.tar.gz

Intended Use

Primary Use Cases

Multi-species cell embeddings
Multi-species cell type classification
Disease state identification from single-cell human transcriptomes
Inferring gene-gene regulatory relationships
Inferring cell type-specific transcription factor interactions

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
Any use that is prohibited by the MIT License.
Any use that is prohibited by the Acceptable Use Policy.

Training Details

Training Data

Species:

Vertebrates: (Homo sapiens, Mus musculus, Oryctolagus cuniculus, Gallus gallus, Xenopus laevis, Danio rerio)
Invertebrates: (Lytechinus variegatus, Caenorhabditis elegans, Drosophila melanogaster, Spongilla lacustris)
Fungus: Saccharomyces cerevisiae
Protist: Plasmodium falciparum

Training Procedure

Shuffling: Randomly permute expressed genes each batch to remove positional bias.
Context length: Up to 2,047 genes per cell and 1 assay token; unused slots masked.
Sampling: Up-weight low-resource species to balance against human/mouse.
Optimizer: AdamW, linear warm-up → cosine decay; global batch ≈ 4-5 M tokens.
Precision: Mixed fp16/bf16 on GPUs.
Infrastructure: H100 cluster with DDP.
Tokens processed: ~3.5 T (~20 epochs).

Speeds, Sizes, Times

Total training tokens: ~3.5 trillion
Trained for up to a number of 15 epochs
Final checkpoint size: ~3GB (model weights)

Training Hyperparameters

AdamW, learning rate max ~5.5e-5, warm-up ~10% of steps, then cosine decay
Mixed-precision floating point (fp16/bf16)

Data Sources

Pretraining datasets are publicly available for:

Homo sapiens and Mus musculus:
- CZ CELLxGENE : Pretraining human and mouse datasets are available from CZ CELLxGENE using the Discover API, comprising 57 million human cells (644 datasets) and 27 million mouse cells (88 datasets). Datasets were filtered to exclude spatial assays and duplicate cells. Both normal (67 millions cells) and diseased (17 million cells across 123 diseases) cells were retained.
- Mouse GEO (GSE247719)
Caenorhabditis elegans (GSE126954, GSE229022, GSE98561)
Danio rerio (GSE202639, Zebrahub)
Drosophila melanogaster (Aging Fly Cell Atlas, Fly Cell Atlas, Alzheimer's Disease Fly Cell Atlas)
Gallus gallus (GSE181577)
Lytechinus variegatus (GSE184538)
Oryctolagus cuniculus (RabbitGastrulation2022)
Plasmodium falciparum (Malaria Cell Atlas)
Saccharomyces cerevisiae (GSE125162)
Spongilla lacustris (GSE134912)
Xenopus laevis (GSE195790)

A complete summary can be found in this supplementary table.

Performance Metrics

Metrics

Macro F1 for multi-class cell type classification (robust to class imbalance).
Cross-species generalization measured by macro F1 across species at varyingly evolutionary distances from humans.

Evaluation Datasets

Tabula Sapiens v2 (TSv2) A human scRNA-seq reference cell atlas (Quake et al. 2024). We downloaded single-cell RNA sequencing data from the Tabula Sapiens Consortium via CZ CELLxGENE. To ensure independence from our training data, we filtered the ”All Cells” dataset to include nine donor IDs absent from Tabula Sapiens v1 (TSP17 to TSP30). We excluded pancreas data due to insufficient cell counts (<100 cells).
- Homo sapiens Tabula Sapiens 2.0 (subset of Tabula Sapiens)
Cell atlas datasets: Zebrafish Danio rerio (GSE130487), Microcebus murinus (Tabula Microcebus), Sea lamprey brain Petromyzon marinus (E-MTAB-11087), Stony coral Stylophora pistillata (GSE166901), Tropical clawed frog Xenopus tropicalis (GSE113074)
Spermatogenesis dataset A multi-species snRNA-seq dataset of testes from the major lineages of mammals and birds (Murat et al. 2023). Data from nine species (H. sapiens, Gorilla gorilla, G. gallus, Callithrix jacchus, Macaca mulatta, Monodelphis domestica, M. musculus, Ornithorhynchus anatinus, Pan troglodytes) were downloaded from two sources (ArrayExpress and Bgee). Data were aligned and merged to create a dataset with raw counts and harmonized metadata
- Homo sapiens E-MTAB-11063
- Pan troglodytes E-MTAB-11064
- Gorilla gorilla E-MTAB-11065
- Macaca mulatta E-MTAB-11068
- Callithrix jacchus E-MTAB-11069
- Mus musculus E-MTAB-11071
- Monodelphis domestica E-MTAB-11072
- Ornithorhynchus anatinus E-MTAB-11070
- Gallus gallus E-MTAB-11073
COVID-19 human lung dataset for infected vs. uninfected classification (Wu et al. 2024).
- COVID-19 human lung (CZ CELLxGENE)

Evaluation Results

Tabula Sapiens v2: TF-Metazoa & TF-Exemplar achieve macro F1 up to 0.91+, outperforming prior baselines (e.g., UCE, scGPT).
Cross-species: Maintains F1 > 0.7 on species separated by ~600+ million years (e.g., stony coral).
COVID-19: F1 ~0.85-0.86 in distinguishing infected vs. healthy lung cells, higher than baseline ~0.80.

Biases, Risks, and Limitations

Potential Biases

Biological sampling bias: Overrepresentation of well-studied tissues and species (human, mouse) might skew performance on underrepresented organisms or rare cell types.
Demographic underrepresentation: Human data may reflect limited genetic ancestries or specific disease contexts.

Risks

Areas of risk may include but are not limited to:

Inaccurate outputs (“hallucinations”): The model may produce biologically implausible gene combinations if used for generative tasks with insufficient context.
Misinterpretation: Predictions for disease states or perturbations should not be used as a substitute for experimental validation.
Cross-species pitfalls: While the model generalizes impressively across species, extremely distant species or unusual tissues may yield lower accuracy.

Limitations

Does not handle spatial transcriptomics or multi-modal data (ATAC-seq, proteomics) directly.
Large GPU memory required for large-batch or high-throughput inference.

Caveats and Recommendations

Review and validate outputs generated by the model.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Should you have any security or privacy issues or questions related to this model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

Acknowledgements

We thank the many public scRNA-seq consortia and data repositories (e.g., CZ CELLxGENE, GEO) for making single-cell data freely available.

Special thanks to the authors of ESM-2 for enabling robust protein-based gene embeddings.

Gratitude to the collaborating labs and HPC teams that facilitated large-scale model training.

For further suggestions or inquiries, please contact virtualcellmodels@chanzuckerberg.com

If you have recommendations for this model card please contact virtualcellmodels@chanzuckerberg.com.