scVI trained on CELLxGENE Census (Homo Sapiens)

Version 2024-07-01 released 01 Jul 2024

License

Repository

Developed By

Nir Yosef Lab
CZ CELLxGENE

Single-cell Variational Inference (scVI) is a probabilistic deep generative model designed to analyze single-cell RNA sequencing (scRNA-seq) data. It is based on a variational autoencoder (VAE) architecture and uses deep neural networks to model gene expression levels while correcting for batch effects and technical variability. scVI is trained on large-scale single-cell datasets, allowing it to perform robustly across tasks like batch correction, dimensionality reduction, clustering, and differential expression analysis.

Try Model with Demo Dataset

Model Details

Model Architecture

SCVI config

Layers: 2, Hidden Units: 512, Latent Dimensions: 50

Parameters

About 7.1 million parameters.

SCVI config

Citation

Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Single-cell Variational Inference (scVI): A deep generative model for single-cell RNA sequencing data. Nature Methods, 15(12), 1053–1058. doi:10.1038/s41592-018-0229-2

Model Card Authors

Maximilian Lombardo (Chan Zuckerberg Initiative)

Primary Contact Email

cellxgene@chanzuckerberg.com

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

Getting Started

scVI for cell type prediction and data projection — cellxgene-census documentation

Intended Use

Primary Use Cases

Batch Correction: Adjusts for batch effects in scRNA-seq datasets, ensuring more accurate downstream analysis.
Dimensionality Reduction: Embeds high-dimensional single-cell data into a low-dimensional latent space for visualization.
Clustering: Groups similar cells based on their gene expression profiles, identifying cell subpopulations.
Differential Expression Analysis: Identifies genes with varying expression levels between different cell populations.
Imputation: Fills in missing gene expression values in single-cell datasets to reduce noise and enhance analysis accuracy.
Data Integration: Merges multiple datasets while accounting for technical variability.

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Predicting Individual Health Outcomes: Using the model to make medical diagnoses or personalized health predictions beyond research purposes.
De-anonymizing Single-Cell Data: Attempting to identify individuals based on their single-cell gene expression data.
Generating Synthetic Data for Misuse: Creating false or misleading single-cell data to support fraudulent scientific claims.
Discriminatory Use: Using the model to reinforce biases in biological data analysis, leading to discriminatory outcomes in medical research.
Bioweapon Research: Applying the model to enhance pathogenic studies or genetic engineering of harmful organisms.
Any use that is prohibited by the Acceptable Use Policy for AI or BSD 3-Clause License

Intended Users

Computational Biologists: Researchers analyzing single-cell RNA sequencing data for biological insights, such as cell types and gene expression patterns.
Bioinformaticians: Professionals who integrate and preprocess multi-omics data for large-scale biological analysis and data integration.
Genomics Researchers: Scientists focusing on understanding cellular diversity, differentiation, and disease mechanisms using single-cell datasets.
Data Scientists in Life Sciences: Experts applying machine learning techniques to biological datasets for exploratory analysis and hypothesis generation.
Biotech Companies: Organizations developing products or therapeutics based on single-cell technologies for drug discovery or diagnostics.

Training Details

Complete training procedure described here: https://github.com/chanzuckerberg/cellxgene-census/tree/main/tools/models/scvi

Training Data

The scVI model is trained on non-spatial, human RNA sequencing data from the CZ CELLxGENE Discover Census, version 2024-07-01 (74.3M cells).

Preprocessing

The model uses raw count data from human cells sourced through the CZ CELLxGENE Census API. Cells and genes are filtered based on metadata (e.g., is_primary_data == True, with a minimum non-zero count threshold). Highly variable genes (HVGs) are selected (top 8000), stratified by batch variables like suspension type and assay type.

Training Hyperparameters

Learning Rate: 1.0e-4
Precision: fp32 (standard single-precision floating point)
Batch Size: 1024
Max Epochs: 100
KL Warm-up Epochs: 20
Max KL Weight: 1
Early Stopping: Not enabled by default

Data Sources

The model was trained on the following types of datasets:

Chan Zuckerberg Initiative Single-Cell Biology Program, Abdulla, S., Aevermann, B., Assis, P., Badajoz, S., Bell, S. M., Bezzi, E., et al. (2023). CZ CELL×GENE Discover: A single-cell data platform for scalable exploration, analysis, and modeling of aggregated data. bioRxiv. https://doi.org/10.1101/2023.10.30.563174
Data Card for CZ CELLxGENE Discover

Performance Metrics

Evaluation Protocols

Benchmarks of single-cell Census models — cellxgene-census documentation

Metrics

The model was evaluated using a range of benchmarks to measure its performance. Key metrics include:

Type	Mode	Metric	Description
Bio-conservation	Embedding Space	leiden_nmi	Normalized Mutual Information of biological labels and leiden clusters. Described in Luecken et al. and implemented in scib-metrics.
Bio-conservation	Embedding Space	leiden_ari	Adjusted Rand Index of biological labels and leiden clusters. Described in Luecken et al. and implemented in scib-metrics.
Bio-conservation	Embedding Space	silhouette_label	Silhouette score with respect to biological labels. Described in Luecken et al. and implemented in scib-metrics.
Bio-conservation	Label Classifier	classifier_svm	Accuracy of biological label prediction using a SVM (60/40 train/test split). Implemented here.
Bio-conservation	Label Classifier	classifier_forest	Accuracy of biological label prediction using a Random Forest classifier (60/40 train/test split). Implemented here.
Bio-conservation	Label Classifier	classifier_lr	Accuracy of biological label prediction using a Logistic regression classifier (60/40 train/test split). Implemented here.
Batch-correction	Embedding Space	silhouette_batch	1- silhouette score with respect to biological labels. Described in Luecken et al. and implemented in scib-metrics.
Batch-correction	Embedding Space	entropy	Average of neighborhood entropy of batch labels per cell. Implemented here.
Batch-correction	Label Classifier	classifier_svm	1 - accuracy of batch label prediction using a SVM (60/40 train/test split). Implemented here.
Batch-correction	Label Classifier	classifier_forest	1 - accuracy of batch label prediction using a Random Forest classifier (60/40 train/test split). Implemented here.
Batch-correction	Label Classifier	classifier_lr	1 - accuracy of batch label prediction using a Logistic regression classifier (60/40 train/test split). Implemented here.

Evaluation Datasets

The model was evaluated with the following types of datasets:

The bio-conservation metrics were run the in following biological labels in a Census cells from Adipose Tissue and Spinal Cord:
- Cell subclass: a higher definition of a cell type with maximum of 73 unique labels, as defined on the CZ CELLxGENE collection page.
- Cell class: an even higher definition of a cell type with a maximum of 22 unique labels, also defined on the CZ CELLxGENE collection page.
The batch-correction metrics were run the in following batch labels in a Census cells from Adipose Tissue and Spinal Cord:
- Assay: the sequencing technology.
- Dataset: the dataset from which the cell originated from.
- Suspension type: cell vs nucleus.
- Batch: the concatenation of values for all of the above.

Evaluation Results

Evaluation Metrics URL

Benchmarks of single-cell Census models — cellxgene-census documentation

Bio-conservation

Batch-correction

Biases, Risks, and Limitations

Potential Biases

The model may exhibit biases present in the training data, particularly from underrepresented tissues, cell types, or ethnicities, leading to skewed predictions.
Specific groups or conditions (e.g., rare diseases or minority populations) may be underrepresented in the dataset, impacting generalizability.

Risks

Areas of risk include but are not limited to:

Limited training data on rare cell types or conditions may result in incomplete predictions.
Mislabeling or failing to recognize cell types accurately.
Potential misuse for incorrect biological interpretations or medical advice.

Limitations

The model's performance may degrade when analyzing cell types, tissues, or species not well represented in the training data.
The model may not perform well for datasets with unusual sequencing technologies or low-quality data.

Caveats and Recommendations

Users should validate model outputs against independent datasets to mitigate biases and inaccuracies.
It is advised to use the model in conjunction with expert biological knowledge, especially when working with novel or underrepresented data.
Further development of the model should include expanding the diversity of the training data to reduce bias and improve generalizability across different cell types, tissues, and conditions.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.

Should you have any security or privacy issues or questions related to this model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

How to Use the Model

Installation

scVI for cell type prediction and data projection — cellxgene-census documentation

Example Code

Checkout scVI tutorial for full set of example code.

Acknowledgements

Chan Zuckerberg Initiative, Nir Yosef Group, Can Ergen, scvi-tools, Emanuele Bezzi