Quickstart: scVI

Learning Goals

This quickstart will guide you on how to leverage the scVI model to generate embeddings that can be used for single cell analysis. Specifically, you will learn how to:

Access the model: we will use the scVI model, trained on 44 million cells from the CZ CELLxGENE Census.
Process your data through the model, generating embeddings that allow for comparisons between your dataset and the CZ CELLxGENE Census.

If you would like a full version of this tutorial, you can refer to this notebook.

Pre-requisites and requirements

This tutorial assumes a basic understanding of:

Python
AnnData
single-cell data analysis (see this tutorial for a primer on the subject). You can this tutorial locally depending on your hardware. It was originally run on an M3 MacBook with 32 GiB of memory.

If running locally, environment setup will be addressed in a later section.

Overview

Below is an overview of what we will cover:

Environment Setup
Quickstart

Introduction

In this tutorial, we will explore how to use probabilistic modeling (in this example, scVI trained on the 74 million cells from CZ CELLxGENE) to generate embeddings from a raw gene expression count matrix.

Setup

Step 1: Colab (Recommended)

We recommend running this tutorial via colab since most of the environment set up will be taken care of for you by running this notebook.

To start, connect to the T4 GPU runtime hosted for free by Google Colab using the dropdown menu in the upper right hand corner of this notebook.

Note that this tutorial will use commands written for Google Colab, and some of those commands may need to be modified to work with other computing setups.

You can check what version of python is running in the Colab environment by running the following command.

!python --version

To get started with this tutorial, you’ll need to set up the appropriate environment and download the necessary resources.

You can access the notebook for this tutorial in the following location. Additionally, you can download the rare disease dataset we’ll be analyzing from this link.

Step 1A: Local (Alternative)

We recommend setting up the tutorial environment with Conda to manage dependencies easily. If you don’t already have Conda installed, follow the installation instructions on this page. For this tutorial, the Miniconda distribution will work well.

Once Conda is installed, you can create and activate a Python 3.11-based Conda environment using the following bash (terminal) commands:

conda create -n "vcp-env" python=3.11
conda activate vcp-env

Additionally, you will need to download the necssary resources to run the tutorial locally. You can access the notebook for this tutorial in the following location. Additionally, you can download the rare disease dataset we’ll be analyzing from this link.

Step 2: Install requirements

Run the following command to install the required packages (you may see some errors related to pip's dependency resolvers but this should not impact your ability to run the tutorial):

!pip install scvi-tools==1.2.0 tiledbsoma==1.14.4 'cellxgene_census[experimental]'==1.16.2 scanpy==1.10.3 jupyterlab jupyter ipywidgets

Step 3: Download data

If you aren't running this tutorial in colab, you can download the AnnData file for this tutorial here. Otherwise, we will download the dataset in the next code cell.

!gdown --fuzzy https://drive.google.com/file/d/13v6fuGqdZvKeUp-XcWUqA7hWiE8Hiv7y/view?usp=sharing

In the output of the code above, colab will specify the location name of the downloaded file. In case you want to ensure that file has been downloaded, you list out the contents of our current working directory.

!ls

# Output:
    sample_data  test_anndata.h5ad

Step 4: Check environment

To verify that your environment is correctly configured:

Launch the notebook (if running locally):

jupyter lab ./vcp-tutorial-scvi.ipynb

Run import statements (below) to ensure all dependencies are correctly installed. Note that there are some optional imports that might be necessary depending on your compute environment (if running this notebook within an HPC environment)

#utils
import os,sys
import functools
import gc
import warnings
import pprint
import yaml
import json

warnings.filterwarnings('ignore')

#scvi-tools, CZ CELLxGENE Census and TileDB SOMA
import cellxgene_census
from cellxgene_census.experimental import get_embedding
import tiledbsoma as soma
#import tiledb.cloud - optional may be need for some HPC and cluster environments
import scvi

#single cell
import scanpy as sc
import anndata as ad

#scientific computing
import numpy as np
import pandas as pd
import scipy.sparse as sp
from sklearn.ensemble import RandomForestClassifier

#plotting
import seaborn as sn
import matplotlib.pylab as plt

# Configure Global Variables

## Set latest Census Version
CENSUS_VERSION = "2024-07-01"

# structure/filter notebook output
pp = pprint.PrettyPrinter(indent=4)
%matplotlib inline

Quickstart

Load the data

We will load the dataset, where the raw gene expression count matrix will be stored in adata.X, and the adata.var index will contain features labeled as Ensembl Gene IDs.

adata = sc.read('./test_anndata.h5ad')  # Replace './raw-data/...' with the path where your data is saved

Adding metadata to the object

adata.obs["n_counts"] = adata.X.sum(axis=1)
adata.obs["joinid"] = list(range(adata.n_obs))

# initialize the batch to be unassigned. This could be any dummy value.
adata.obs["batch"] = "unassigned"

adata.obs['disease'] = "diffuse midline glioma" # add a general disease annotation to our data

Get model uri and download

# Retrieve scVI model from CZ CELLxGENE Census API
with cellxgene_census.open_soma(census_version=CENSUS_VERSION) as census:
    census = cellxgene_census.open_soma(census_version=CENSUS_VERSION)
    scvi_info = cellxgene_census.experimental.get_embedding_metadata_by_name(
        embedding_name="scvi",
        organism="homo_sapiens",
        census_version=CENSUS_VERSION,
    )

scvi_info["model_link"]

We’ll use the following magic commands to create a directory for storing the model object and download it. If wget is not already installed, you can install it or use curl as an alternative.

## download model
!mkdir -p scvi-human-2024-07-01
!wget --no-check-certificate -q -O scvi-human-2024-07-01/model.pt https://cellxgene-contrib-public.s3.us-west-2.amazonaws.com/models/scvi/2024-07-01/homo_sapiens/model.pt

Prepare data and generate embeddings

scvi.model.SCVI.prepare_query_anndata(adata,
                                      "scvi-human-2024-07-01",
                                      inplace=True)

vae_q = scvi.model.SCVI.load_query_data(
    adata,
    "scvi-human-2024-07-01",
)

# This allows for a simple forward pass
vae_q.is_trained = True
latent = vae_q.get_latent_representation()
adata.obsm["scvi"] = latent

We can inspect the dimensions of our new embeddings. This representation of the data can be treated as a set of features, allowing it to serve as input for standard clustering and visualization algorithms. Note that the dimensionality of our scVI embedding space is 50, which is much smaller than the original gene expression space.

adata.obsm['scvi'].shape

# Output:
    (4417, 50)

References

Please refer to the following papers for more information about:

Development of the scVI model:

Lopez, R., Regier, J., Cole, M.B. et al. Deep generative modeling for single-cell transcriptomics. Nat Methods 15, 1053–1058 (2018). https://doi.org/10.1038/s41592-018-0229-2

Training data for the scVI model via CZ CELLxGENE Discover and Census API

CZ CELLxGENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data CZI Single-Cell Biology, et al. bioRxiv 2023.10.30; doi: https://doi.org/10.1101/2023.10.30.563174

Example Dataset:

Liu, I., Jiang, L., Samuelsson, E.R. et al. The landscape of tumor cell states and spatial organization in H3-K27M mutant diffuse midline glioma across age and location. Nat Genet 54, 1881–1894 (2022). https://doi.org/10.1038/s41588-022-01236-3

Responsible Use

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.

Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

Run Notebook