AIDO.Cell
Version v1.0 released 04 Dec 2024
- Nicholas Ho (GenBio AI)
- Elijah Cole (GenBio AI)
AIDO.Cell-100M is GenBio AI’s state-of-the-art (SOTA) cellular foundation model, trained on 50 million cells spanning a diverse set of human tissues and organs. It is part of the AIDO.Cell family of scalable transformer-based models. The AIDO.Cell models are capable of handling the entire human transcriptome as input, thus learning accurate and general representations of the human cell's entire transcriptional context. AIDO.Cell-100M demonstrates SOTA performance in tasks such as zero-shot clustering, cell type classification, and perturbation modeling.
Model Details
Model Architecture
AIDO.Cell uses an auto-discretization strategy for encoding continuous gene expression values, and uses a bidirectional transformer encoder as its backbone. To learn semantically meaningful representations, it employs a BERT-style encoder-only dense transformer architecture. Minor updates were made to this architecture to align with current best practices, including using SwiGLU and LayerNorms. Model architecture parameters are listed below.
- Layers: 18
- Hidden Size: 650
- Heads: 20
- Intermediate Hidden Size: 1664
Parameters
100 million
Citation
Ho, N., et al. (2024). Scaling Dense Representations for Single Cell with Transcriptome-Scale Context. bioRxiv 2024.11.28.625303; DOI: 10.1101/2024.11.28.625303
Model Card Authors
Caleb Ellington (GenBio AI)
Primary Contact Email
Nicholas Ho nicholas.ho@genbio.ai, Elijah Cole elijah.cole@genbio.ai
System Requirements
AIDO.Cell-100M can be fully finetuned on an 80GB VRAM GPU at batch-size 4. The GPU must be ampere-generation or later to support flash attention (e.g. A100, H100).
Model Variants
To use the Hugging Face weights listed below, the developers provide the AIDO.ModelGenerator Python package as a plug-and-play framework for using AIDO.Cell models. ModelGenerator automatically interfaces with Hugging Face and allows easy one-command embedding and adaptation of the models for a wide variety of fine-tuning tasks.
Model Variant Name | Description | Access URL |
---|---|---|
3M | The smallest pretrained variant in the AIDO.Cell series. Best for rapid prototyping. | |
10M | Our second pretrained variant in the AIDO.Cell series. More powerful than the 3M but lower hardware requirements than the 100M. | |
100M | The largest available pretrained variant in the AIDO.Cell series. Best for clustering, cell-type classification and perturbation modeling. |
Intended Use
Primary Use Cases
- Embedding: AIDO.Cell produces rich embeddings of cells and genes which can be used for downstream predictive tasks.
- Batch Integration: The AIDO.Cell embeddings can be used to integrate multiple scRNA-seq datasets.
- Contextualized Embeddings for Genetic Perturbation Modeling: The AIDO.Cell embeddings can contextualize cellular expression profiles for more accurate perturbation response prediction.
- In-silico Perturbations: Each gene's expression can be individually manipulated and passed through AIDO.Cell to understand conditional dependencies between genes and predict counterfactual expression profiles.
- Cell Type Classification: The AIDO.Cell embeddings can be used to classify cell types.
Out-of-Scope or Unauthorized Use Cases
- Any use that is prohibited by the GenBio AI Community License Agreement. The license permits non-commercial and academic use with proper attribution.
- Any use that is prohibited by the Acceptable Use Policy.
Training Details
Training Data
AIDO.Cell was pretrained on a diverse dataset of 50 million cells representing over 100 tissue types. The training data was previously curated by the scFoundation team and included datasets from the Gene Expression Omnibus (GEO), the Deeply Integrated human Single-Cell Omnics data (DISCO), the human ensemble cell atlas (hECA), Single Cell Portal and more. After preprocessing and quality control, the training dataset contained 50 million cells, or 963 total billion gene tokens. The dataset was partitioned to set aside 100,000 cells as the validation set.
Training Procedure
AIDO.Cell models were trained with bfloat-16 precision to optimize memory and speed. Pre-training took place on 256 H100 GPUs over three days for the AIDO.Cell-100M.
Training Hyperparameters
- Pretraining Hyperparameters:
- Cosine Learning Rate with a Linear warm-up of 5% for 150,000 iterations in total
- Max Learning rate of 3e-4 for the 100M model
- AdamW optimizer with Beta1=0.9 and Beta2=1e-2
- Weight decay of 1e-2
- Models were trained with bfloat-16 precision
Data Sources
Training data was collected from the curated dataset reported by the scFoundation team. For dataset details, see Supplementary Data 1 and Supplementary Data 2 of scFoundation paper.
Performance Metrics
Metrics
The performance of AIDO.Cell-100M for cell type classification was evaluated using the F1 macro score.
Evaluation Datasets
Model performance was evaluated using two distinct datasets: Zhen68K, a widely used peripheral blood mononuclear cell (PBMC) dataset for benchmarking cell annotation tasks, and Segerstolpe, a smaller pancreas dataset known for its difficulty to annotate cells. The same splits reported in the scFoundation paper were used for cell type classification. The processed datasets for cell type classification can be obtained from the Hugging Face repository and include:
Evaluation Results
Model | Zheng68K (F1 Macro) | Segerstolpe (F1 Macro) |
---|---|---|
AIDO.Cell (100M) | 0.761 | 0.910 |
scFoundation (100M) | 0.736 | 0.914 |
scBERT | 0.67 | 0.67 |
CellTypist | 0.725 | 0.812 |
scANVI | 0.395 | 0.521 |
ACTINN | 0.649 | 0.722 |
Scanpy | 0.547 | 0.54 |
SingleCellNet | 0.598 | 0.806 |
Evaluation Metrics URL
For more evaluations on different tasks that use the AIDO.Cell embeddings (e.g., perturbation prediction), please refer to the AIDO.Cell preprint.
Biases, Risks and Limitations
Potential Biases
The model may reflect biases present in the pretraining data. Particularly from undersampled or underrepresented populations, cell types, or ethnicities.
Risks
Areas of risk may include but are not limited to:
- AIDO.Cell has been tested on key benchmarks, but these only cover a small portion of potential use cases. In untested cases, such as rare cell types, the model may make less accurate predictions, or provide less informative embeddings.
Limitations
- The model may make less informative embeddings when applied to cell types or tissues that are underrepresented in the pretraining data.
- The model is trained on human data, and may provide less informative embeddings on different species.
- The model may struggle with less common sequencing technologies.
- The model may struggle to make predictions on poor quality scRNA-seq data.
Caveats and Recommendations
- Review and validate outputs generated by the model with cross-fold validation for uncertainty quantification, and with held-out populations.
- Users are encouraged to validate model predictions against independent datasets to help reduce bias and improve accuracy.
- It is recommended to complement model use with expert biological knowledge, especially when dealing with novel or underrepresented data.
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.
Acknowledgements
This work was possible thanks to all the developers: Nicholas Ho, Elijah Cole, Caleb Ellington, Jinyu Hou, Sohan Addagudi, Shentong Mo, Tianhua Tao, Dian Li, Yonghao Zhuang, Hongyi Wang, Xingyi Cheng, Le Song, and Eric P. Xing. The developers gratefully acknowledge the innumerable authors who contributed data to the NIH GEO, DISCO, hECA, CZ CELLxGENE, and Single Cell Portal repositories, and these organizations for hosting the data. Thanks to the Chan Zuckerberg Initiative for hosting model resources provided through this platform.
If you have recommendations for this model card please contact virtualcellmodels@chanzuckerberg.com.