scGPT
Version v1.0 released 26 Feb 2024
scGPT is a foundation model designed to integrate and analyze large-scale single-cell multi-omics data using a generative pre-trained transformer (GPT) architecture. It learns cell and gene representations from millions of single-cell transcriptomes and can be fine-tuned for downstream tasks such as batch correction, multi-omics integration, cell type annotation, gene network inference, and genetic perturbation prediction.
Model Details
Model Architecture
- Embedding size: 512
- Number of transformer blocks: 12
- Number of attention heads per block: 8
Parameters
53 million
Model URI
Citation
Cui, H., Wang, C., Maan, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods 21, 1470–1480 (2024). doi.org/10.1038/s41592-024-02201-0
Model Card Authors
Chan Zuckerberg Initiative
Model Card Contact
cellxgene@chanzuckerberg.comIntended Use
Primary Use Cases
- Multi-batch integration: scGPT can be used to integrate multiple scRNA-seq datasets, correcting for batch effects while preserving biological variances
- Multi-omic integration: scGPT framework can be extended to integrate data from multiple sequencing modalities, including scRNA-seq, scATAC-seq, and protein abundance data.
- Cell-type annotation: annotate single cells based on their gene expression profiles.
- Genetic perturbation prediction: predict the effects of genetic perturbations on gene expression.
- Gene network inference: construct gene similarity networks that reveal gene-gene interactions
Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Predicting Individual Health Outcomes: Using the model to make medical diagnoses or personalized health predictions beyond research purposes.
- De-anonymizing Single-Cell Data: Attempting to identify individuals based on their single-cell gene expression data.
- Generating Synthetic Data for Misuse: Creating false or misleading single-cell data to support fraudulent scientific claims.
- Discriminatory Use: Using the model to reinforce biases in biological data analysis, leading to discriminatory outcomes in medical research.
- Bioweapon Research: Applying the model to enhance pathogenic studies or genetic engineering of harmful organisms.
- Any use that is prohibited by the Acceptable Use Policy or MIT License
Training Details
Training Data
The scGPT model is trained on non-spatial RNA sequencing data from the CZ CELLxGENE Discover Census.
Preprocessing
- The initial input for scGPT is a raw count (Cell X Gene Matrix)
- Tokenization - Each gene is treated as a distinct token and is a assigned a unique identifier
- Value binning technique to convert all expression counts into relative values
- Condition tokens encompass diverse meta information associated with individual genes, such as functional pathways (represented by pathway tokens) or perturbation experiment alterations (indicated by perturbation tokens).
Training Hyperparameters
- Optimization:
- Optimizer: The model is optimized using the Adam optimizer.
- Batch Size: The mini-batch size used during training is 512.
- Learning Rate: The initial learning rate is set to 0.0001.
- Weight Decay: A weight decay of 0.9 is applied after each epoch.
- Number of Epochs: The model is trained for a total of 6 epochs.
- Fine-tuning:
- Learning Rate: The initial learning rate for fine-tuning is 0.0001, decaying by 10% after each epoch.
- Mask Ratio for GEP and GEPC: This ratio is set to 0.4.
- β Parameter in ECS: This parameter is set to 0.6.
- ECS Weighting: When combined with other losses, the ECS objective is given a weighting of 10.
- Train/Evaluation Split: Datasets are divided into 90% for training and 10% for evaluation.
- Number of Epochs: The model is trained for 30 epochs for most fine-tuning tasks.
Data Sources
The model was trained on the following types of datasets:
- CZI Single-Cell Biology Program, Abdulla, S., Aevermann, B., Assis, P., Badajoz, S., Bell, S. M., Bezzi, E., et al. (2023). CZ CELL×GENE Discover: A single-cell data platform for scalable exploration, analysis, and modeling of aggregated data. bioRxiv. https://doi.org/10.1101/2023.10.30.563174
Bias, Risks, and Limitations
Potential Biases
- The model may exhibit biases present in the training data, particularly from underrepresented tissues, cell types, or ethnicities, leading to skewed predictions.
- Specific groups or conditions (e.g., rare diseases or minority populations) may be underrepresented in the dataset, impacting generalizability.
Risks
Areas of risk include but are not limited to:
- Limited training data on rare cell types or conditions may result in incomplete predictions.
- Mislabeling or failing to recognize cell types accurately.
- Potential misuse for incorrect biological interpretations or medical advice.
Limitations
- The model’s performance may degrade when analyzing cell types, tissues, or species not well represented in the training data.
- The model may not perform well for datasets with unusual sequencing technologies or low-quality data.
Caveats and Recommendations
- Users should validate model outputs against independent datasets to mitigate biases and inaccuracies.
- It is advised to use the model in conjunction with expert biological knowledge, especially when working with novel or underrepresented data.
- Further development of the model should include expanding the diversity of the training data to reduce bias and improve generalizability across different cell types, tissues, and conditions.
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.
Acknowledgements
Chan Zuckerberg Initiative, Bo Wang Group, Haotian Cui, Emanuele Bezzi