scVI trained on CELLxGENE Census (Homo Sapiens)
Version 2024-07-01 released 01 Jul 2024
- Nir Yosef Lab
- CZ CELLxGENE
Single-cell Variational Inference (scVI) is a probabilistic deep generative model designed to analyze single-cell RNA sequencing (scRNA-seq) data. It is based on a variational autoencoder (VAE) architecture and uses deep neural networks to model gene expression levels while correcting for batch effects and technical variability. scVI is trained on large-scale single-cell datasets, allowing it to perform robustly across tasks like batch correction, dimensionality reduction, clustering, and differential expression analysis.
Model Details
Model Architecture
Layers: 2, Hidden Units: 512, Latent Dimensions: 50
Parameters
About 7.1 million parameters.
Citation
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Single-cell Variational Inference (scVI): A deep generative model for single-cell RNA sequencing data. Nature Methods, 15(12), 1053–1058. doi:10.1038/s41592-018-0229-2
Model Card Authors
Chan Zuckerberg Initiative
Model Card Contact
cellxgene@chanzuckerberg.comGetting Started
Intended Use
Primary Use Cases
- Batch Correction: Adjusts for batch effects in scRNA-seq datasets, ensuring more accurate downstream analysis.
- Dimensionality Reduction: Embeds high-dimensional single-cell data into a low-dimensional latent space for visualization.
- Clustering: Groups similar cells based on their gene expression profiles, identifying cell subpopulations.
- Differential Expression Analysis: Identifies genes with varying expression levels between different cell populations.
- Imputation: Fills in missing gene expression values in single-cell datasets to reduce noise and enhance analysis accuracy.
- Data Integration: Merges multiple datasets while accounting for technical variability.
Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Predicting Individual Health Outcomes: Using the model to make medical diagnoses or personalized health predictions beyond research purposes.
- De-anonymizing Single-Cell Data: Attempting to identify individuals based on their single-cell gene expression data.
- Generating Synthetic Data for Misuse: Creating false or misleading single-cell data to support fraudulent scientific claims.
- Discriminatory Use: Using the model to reinforce biases in biological data analysis, leading to discriminatory outcomes in medical research.
- Bioweapon Research: Applying the model to enhance pathogenic studies or genetic engineering of harmful organisms.
- Any use that is prohibited by the Acceptable Use Policy for AI or BSD 3-Clause License
Intended Users
- Computational Biologists: Researchers analyzing single-cell RNA sequencing data for biological insights, such as cell types and gene expression patterns.
- Bioinformaticians: Professionals who integrate and preprocess multi-omics data for large-scale biological analysis and data integration.
- Genomics Researchers: Scientists focusing on understanding cellular diversity, differentiation, and disease mechanisms using single-cell datasets.
- Data Scientists in Life Sciences: Experts applying machine learning techniques to biological datasets for exploratory analysis and hypothesis generation.
- Biotech Companies: Organizations developing products or therapeutics based on single-cell technologies for drug discovery or diagnostics.
Training Details
Complete training procedure described here: https://github.com/chanzuckerberg/cellxgene-census/tree/main/tools/models/scvi
Training Data
The scVI model is trained on non-spatial, human RNA sequencing data from the CZ CELLxGENE Discover Census, version 2024-07-01 (74.3M cells).
Preprocessing
The model uses raw count data from human cells sourced through the CELLxGENE Census API. Cells and genes are filtered based on metadata (e.g., is_primary_data == True
, with a minimum non-zero count threshold). Highly variable genes (HVGs) are selected (top 8000), stratified by batch variables like suspension type and assay type.
Training Hyperparameters
- Learning Rate: 1.0e-4
- Precision: fp32 (standard single-precision floating point)
- Batch Size: 1024
- Max Epochs: 100
- KL Warm-up Epochs: 20
- Max KL Weight: 1
- Early Stopping: Not enabled by default
Data Sources
The model was trained on the following types of datasets:
- Chan Zuckerberg Initiative Single-Cell Biology Program, Abdulla, S., Aevermann, B., Assis, P., Badajoz, S., Bell, S. M., Bezzi, E., et al. (2023). CZ CELL×GENE Discover: A single-cell data platform for scalable exploration, analysis, and modeling of aggregated data. bioRxiv. https://doi.org/10.1101/2023.10.30.563174
- Data Card for CZ CELLxGENE Discover
Performance Metrics
Evaluation Protocols
Benchmarks of single-cell Census models — cellxgene-census documentationMetrics
The model was evaluated using a range of benchmarks to measure its performance. Key metrics include:
Type | Mode | Metric | Description |
---|---|---|---|
Bio-conservation | Embedding Space | leiden_nmi | Normalized Mutual Information of biological labels and leiden clusters. Described in Luecken et al. and implemented in scib-metrics. |
Bio-conservation | Embedding Space | leiden_ari | Adjusted Rand Index of biological labels and leiden clusters. Described in Luecken et al. and implemented in scib-metrics. |
Bio-conservation | Embedding Space | silhouette_label | Silhouette score with respect to biological labels. Described in Luecken et al. and implemented in scib-metrics. |
Bio-conservation | Label Classifier | classifier_svm | Accuracy of biological label prediction using a SVM (60/40 train/test split). Implemented here. |
Bio-conservation | Label Classifier | classifier_forest | Accuracy of biological label prediction using a Random Forest classifier (60/40 train/test split). Implemented here. |
Bio-conservation | Label Classifier | classifier_lr | Accuracy of biological label prediction using a Logistic regression classifier (60/40 train/test split). Implemented here. |
Batch-correction | Embedding Space | silhouette_batch | 1- silhouette score with respect to biological labels. Described in Luecken et al. and implemented in scib-metrics. |
Batch-correction | Embedding Space | entropy | Average of neighborhood entropy of batch labels per cell. Implemented here. |
Batch-correction | Label Classifier | classifier_svm | 1 - accuracy of batch label prediction using a SVM (60/40 train/test split). Implemented here. |
Batch-correction | Label Classifier | classifier_forest | 1 - accuracy of batch label prediction using a Random Forest classifier (60/40 train/test split). Implemented here. |
Batch-correction | Label Classifier | classifier_lr | 1 - accuracy of batch label prediction using a Logistic regression classifier (60/40 train/test split). Implemented here. |
Evaluation Datasets
The model was evaluated with the following types of datasets:
- The bio-conservation metrics were run the in following biological labels in a Census cells from Adipose Tissue and Spinal Cord:
- Cell subclass: a higher definition of a cell type with maximum of 73 unique labels, as defined on the CELLxGENE collection page.
- Cell class: an even higher definition of a cell type with a maximum of 22 unique labels, also defined on the CELLxGENE collection page.
- The batch-correction metrics were run the in following batch labels in a Census cells from Adipose Tissue and Spinal Cord:
- Assay: the sequencing technology.
- Dataset: the dataset from which the cell originated from.
- Suspension type: cell vs nucleus.
- Batch: the concatenation of values for all of the above.
Evaluation Results
Evaluation Metrics URL
Benchmarks of single-cell Census models — cellxgene-census documentationBio-conservation
Batch-correction
Bias, Risks, and Limitations
Potential Biases
- The model may exhibit biases present in the training data, particularly from underrepresented tissues, cell types, or ethnicities, leading to skewed predictions.
- Specific groups or conditions (e.g., rare diseases or minority populations) may be underrepresented in the dataset, impacting generalizability.
Risks
Areas of risk include but are not limited to:
- Limited training data on rare cell types or conditions may result in incomplete predictions.
- Mislabeling or failing to recognize cell types accurately.
- Potential misuse for incorrect biological interpretations or medical advice.
Limitations
- The model's performance may degrade when analyzing cell types, tissues, or species not well represented in the training data.
- The model may not perform well for datasets with unusual sequencing technologies or low-quality data.
Caveats and Recommendations
- Users should validate model outputs against independent datasets to mitigate biases and inaccuracies.
- It is advised to use the model in conjunction with expert biological knowledge, especially when working with novel or underrepresented data.
- Further development of the model should include expanding the diversity of the training data to reduce bias and improve generalizability across different cell types, tissues, and conditions.
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.
How to Use the Model
Installation
Example Code
Checkout scVI tutorial for full set of example code.
Acknowledgements
Chan Zuckerberg Initiative, Nir Yosef Group, Can Ergen, scvi-tools, Emanuele Bezzi