AIDO.Cell

Version v1.0 released 04 Dec 2024

License

Repository

https://huggingface.co/genbio-ai/AIDO.Cell-100M

Developed By

Nicholas Ho (GenBio AI)
Elijah Cole (GenBio AI)

AIDO.Cell-100M is GenBio AI’s state-of-the-art (SOTA) cellular foundation model, trained on 50 million cells spanning a diverse set of human tissues and organs. It is part of the AIDO.Cell family of scalable transformer-based models. The AIDO.Cell models are capable of handling the entire human transcriptome as input, thus learning accurate and general representations of the human cell's entire transcriptional context. AIDO.Cell-100M demonstrates SOTA performance in tasks such as zero-shot clustering, cell type classification, and perturbation modeling.

Get Started with Model

Model Details

Model Architecture

AIDO.Cell uses an auto-discretization strategy for encoding continuous gene expression values, and uses a bidirectional transformer encoder as its backbone. To learn semantically meaningful representations, it employs a BERT-style encoder-only dense transformer architecture. Minor updates were made to this architecture to align with current best practices, including using SwiGLU and LayerNorms. Model architecture parameters are listed below.

Layers: 18
Hidden Size: 650
Heads: 20
Intermediate Hidden Size: 1664

Parameters

100 million

Citation

Ho, N., et al. (2024). Scaling Dense Representations for Single Cell with Transcriptome-Scale Context. bioRxiv 2024.11.28.625303; DOI: 10.1101/2024.11.28.625303

Model Card Authors

Caleb Ellington (GenBio AI)

Primary Contact Email

Nicholas Ho nicholas.ho@genbio.ai, Elijah Cole elijah.cole@genbio.ai

System Requirements

AIDO.Cell-100M can be fully finetuned on an 80GB VRAM GPU at batch-size 4. The GPU must be ampere-generation or later to support flash attention (e.g. A100, H100).

Model Variants

To use the Hugging Face weights listed below, the developers provide the AIDO.ModelGenerator Python package as a plug-and-play framework for using AIDO.Cell models. ModelGenerator automatically interfaces with Hugging Face and allows easy one-command embedding and adaptation of the models for a wide variety of fine-tuning tasks.

Model Variant Name	Description	Access URL
3M	The smallest pretrained variant in the AIDO.Cell series. Best for rapid prototyping.	https://huggingface.co/genbio-ai/AIDO.Cell-3M
10M	Our second pretrained variant in the AIDO.Cell series. More powerful than the 3M but lower hardware requirements than the 100M.	https://huggingface.co/genbio-ai/AIDO.Cell-10M
100M	The largest available pretrained variant in the AIDO.Cell series. Best for clustering, cell-type classification and perturbation modeling.	https://huggingface.co/genbio-ai/AIDO.Cell-100M

Intended Use

Primary Use Cases

Embedding: AIDO.Cell produces rich embeddings of cells and genes which can be used for downstream predictive tasks.
Batch Integration: The AIDO.Cell embeddings can be used to integrate multiple scRNA-seq datasets.
Contextualized Embeddings for Genetic Perturbation Modeling: The AIDO.Cell embeddings can contextualize cellular expression profiles for more accurate perturbation response prediction.
In-silico Perturbations: Each gene's expression can be individually manipulated and passed through AIDO.Cell to understand conditional dependencies between genes and predict counterfactual expression profiles.
Cell Type Classification: The AIDO.Cell embeddings can be used to classify cell types.

Out-of-Scope or Unauthorized Use Cases

Any use that is prohibited by the GenBio AI Community License Agreement. The license permits non-commercial and academic use with proper attribution.
Any use that is prohibited by the Acceptable Use Policy.

Training Details

Training Data

AIDO.Cell was pretrained on a diverse dataset of 50 million cells representing over 100 tissue types. The training data was previously curated by the scFoundation team and included datasets from the Gene Expression Omnibus (GEO), the Deeply Integrated human Single-Cell Omnics data (DISCO), the human ensemble cell atlas (hECA), Single Cell Portal and more. After preprocessing and quality control, the training dataset contained 50 million cells, or 963 total billion gene tokens. The dataset was partitioned to set aside 100,000 cells as the validation set.

Training Procedure

AIDO.Cell models were trained with bfloat-16 precision to optimize memory and speed. Pre-training took place on 256 H100 GPUs over three days for the AIDO.Cell-100M.

Training Hyperparameters

Pretraining Hyperparameters:
- Cosine Learning Rate with a Linear warm-up of 5% for 150,000 iterations in total
- Max Learning rate of 3e-4 for the 100M model
- AdamW optimizer with Beta1=0.9 and Beta2=1e-2
- Weight decay of 1e-2
- Models were trained with bfloat-16 precision

Data Sources

Training data was collected from the curated dataset reported by the scFoundation team. For dataset details, see Supplementary Data 1 and Supplementary Data 2 of scFoundation paper.

Performance Metrics

Metrics

The performance of AIDO.Cell-100M for cell type classification was evaluated using the F1 macro score.

Evaluation Datasets

Model performance was evaluated using two distinct datasets: Zhen68K, a widely used peripheral blood mononuclear cell (PBMC) dataset for benchmarking cell annotation tasks, and Segerstolpe, a smaller pancreas dataset known for its difficulty to annotate cells. The same splits reported in the scFoundation paper were used for cell type classification. The processed datasets for cell type classification can be obtained from the Hugging Face repository and include:

Evaluation Results

Model	Zheng68K (F1 Macro)	Segerstolpe (F1 Macro)
AIDO.Cell (100M)	0.761	0.910
scFoundation (100M)	0.736	0.914
scBERT	0.67	0.67
CellTypist	0.725	0.812
scANVI	0.395	0.521
ACTINN	0.649	0.722
Scanpy	0.547	0.54
SingleCellNet	0.598	0.806

Evaluation Metrics URL

For more evaluations on different tasks that use the AIDO.Cell embeddings (e.g., perturbation prediction), please refer to the AIDO.Cell preprint.

Biases, Risks and Limitations

Potential Biases

The model may reflect biases present in the pretraining data. Particularly from undersampled or underrepresented populations, cell types, or ethnicities.

Risks

Areas of risk may include but are not limited to:

AIDO.Cell has been tested on key benchmarks, but these only cover a small portion of potential use cases. In untested cases, such as rare cell types, the model may make less accurate predictions, or provide less informative embeddings.

Limitations

The model may make less informative embeddings when applied to cell types or tissues that are underrepresented in the pretraining data.
The model is trained on human data, and may provide less informative embeddings on different species.
The model may struggle with less common sequencing technologies.
The model may struggle to make predictions on poor quality scRNA-seq data.

Caveats and Recommendations

Review and validate outputs generated by the model with cross-fold validation for uncertainty quantification, and with held-out populations.
Users are encouraged to validate model predictions against independent datasets to help reduce bias and improve accuracy.
It is recommended to complement model use with expert biological knowledge, especially when dealing with novel or underrepresented data.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Acknowledgements

This work was possible thanks to all the developers: Nicholas Ho, Elijah Cole, Caleb Ellington, Jinyu Hou, Sohan Addagudi, Shentong Mo, Tianhua Tao, Dian Li, Yonghao Zhuang, Hongyi Wang, Xingyi Cheng, Le Song, and Eric P. Xing. The developers gratefully acknowledge the innumerable authors who contributed data to the NIH GEO, DISCO, hECA, CZ CELLxGENE, and Single Cell Portal repositories, and these organizations for hosting the data. Thanks to the Chan Zuckerberg Initiative for hosting model resources provided through this platform.

If you have recommendations for this model card please contact virtualcellmodels@chanzuckerberg.com.