Universal Cell Embeddings (UCE)

Version v1.0 released 29 Nov 2023

License

Repository

Developed By

Yanay Rosen
Yusuf Roohani
Stephen Quake
Jure Leskovec
(Stanford University)

UCE is a large transformer model for generating cell embeddings. The model was trained on a corpus of cell atlas data from human and seven other species in a completely self-supervised manner without any data annotations. UCE takes a sample of the genes that a cell expressed, weighted by their expression and with replacement. Genes are tokenized using the ESM2 protein language model, allowing cross species and new species embedding.

Read the Preprint

Model Details

Model Architecture

33 layer transformer

Parameters

650 million

Citation

Rosen, Y. et al. (2023) Universal Cell Embeddings: A Foundation Model for Cell Biology bioRxiv 2023.11.28.568918; DOI: 10.1101/2023.11.28.568918.

Model Card Authors

Yanay Rosen

Primary Contact Email

Yanay Rosen yanay@stanford.edu

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

GPU

Intended Use

Primary Use Cases

scRNA-seq cell embedding

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
Any use that is prohibited by the MIT License.
Any use that is prohibited by the Acceptable Use Policy.

Training Details

Training Data

UCE was pretrained on a large corpus of 36 million single cell transcriptomes from 8 species (human, mouse, zebrafish, mouse lemur, crab eating macaque, rhesus macaque, tropical clawed frog, pig) and dozens of tissues. The majority of human and mouse data (>33 million cells) comes from CZ CELLxGENE. UCE was trained using masked binary prediction, where the model distinguishes between genes that were truly expressed in the cell versus genes that had expression equal to 0.

Training Procedure

Datasets were subset to protein coding genes with available ESM2 embeddings. For CELLxGENE datasets, no additional gene subsetting was done. For other datasets, the top 8000 highly variable genes were selected using the Seurat v3 formula implemented in Scanpy.

Speeds, Sizes, Times

The model processes about 3000 cells per minute at batch size 128 on an A100 GPU in bf16 mode using Hugging Face Accelerator.

Training Hyperparameters

bf16 mixed precision with Hugging Face Accelerator. A detailed list of implemented hyperparameters can be found in Supplementary Table 2 of the preprint (download Supplemental Materials).

Data Sources

The majority of the training data was downloaded from the CZ CELLxGENE Census API. Remaining datasets were downloaded from different sources. Download the preprint's Extended Data Table 2 for a detailed list of data sources. Datasets from CZ CELLxGENE were filtered based on cells (minimum gene count of 200) and genes (minimum cell count of 10). Data downloaded from other sources was filtered to 8,000 genes using the Seurat v3 Highly Variable Genes method implemented in Scanpy.

Performance Metrics

Metrics

UCE was benchmarked on a number of held out datasets including the unreleased version 2 of Tabula Sapiens at-the-time. UCE outperformed other models like scGPT and Geneformer in a zero-shot setting using the standard cell embedding benchmark available in the Single-Cell Integration Benchmark. We compared several methods and found that UCE substantially outperforms the next best method, namely Geneformer, by 13.9% on overall score, 16.2% on biological conservation score, and 10.1% on batch correction score.

To comprehensively assess the value of zero-shot embeddings, we also compared UCE to fine-tuned methods that are conventionally used for this task. Notably, UCE even performed slightly better than non-zero-shot methods that require dataset-specific training: scVI and scArches. Tabula Sapiens v2 includes cells measured using both droplet or plate-based sequencing methods. Correcting technology-based batch effects, even for fine-tuned models, can be difficult. UCE zero-shot embeddings of the Tabula Sapiens v2 Ovary tissue, which contains 45,757 cells profiled using 10x-primev3, and 3,610 cells profiled using Smart-seq3, successfully corrects batch effects at the same level as fine tuned methods, while more accurately representing cell types. When scored using the single cell integration benchmark (SCIB), UCE's batch correction scores are close to that of scVI and scArches, while its bio conservation scores are higher. For all cell types in Tabula Sapiens v2, we calculate the silhouette width score of each zero-shot embedding method. For 67% of cell types, UCE has the highest silhouette score of any method. UCE outperformed Geneformer on 80% of cell types, tGPT on 73% of cell types, and scGPT on 83% of cell types.

Evaluation Datasets

Evaluation Results

See preprint for details.

Biases, Risks, and Limitations

Potential Biases

The model may reflect biases present in the training data.

Risks

Areas of risk may include but are not limited to:

Inaccurate outputs or hallucinations.
Potential misuse for incorrect biological interpretations.

Limitations

UCE may not be able to accurately represent cells from species evolutionarily distant from the training data, such as fly.

Caveats and Recommendations

Review and validate outputs generated by the model.
UCE is not intended to be finetuned. When comparing the performance of different methods, please compare UCE in the zero shot setting.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Acknowledgements

We thank Rok Sosič, Kexin Huang, Charlotte Bunne, Hanchen Wang, Michihiro Yasunaga, Michael Moor, Minkai Xu, Mika Jain, George Crowley, Maria Brbić, Jonah Cool, Nicholas Sofroniew, Andrew Tolopko, Ivana Jelic, Ana-Maria Istrate and Pablo Garcia-Nieto for discussions and for providing feedback on our manuscript. We acknowledge support from Robert C. Jones for help with accessing and analyzing the Tabula Sapiens v2 dataset. We acknowledge support from the Chan Zuckerberg Initiative, including help with accessing and processing CxG datasets. We gratefully acknowledge the support of NSF under Nos. OAC-1835598 (CINES), CCF-1918940 (Expeditions), Stanford Data Science Initiative, Wu Tsai Neurosciences Institute, Amazon, Genentech, GSK, Hitachi, Juniper Networks, and KDDI. Y. RH. acknowledges funding support from GlaxoSmithKline.

If you have recommendations for this model card please contact virtualcellmodels@chanzuckerberg.com.