Quick Start: scGPT

This quick start will guide you through using the scGPT model, trained on 33 million cells (including data from the CELLxGENE Census), to generate embeddings for single-cell transcriptomic data analysis.

Learning Goals

By the end of this tutorial, you will understand how to:

  1. Access and prepare the scGPT model for use.
  2. Generate embeddings to analyze and compare your dataset against the CELLxGENE Census.
  3. Visualize the results using a UMAP, colored by cell type.

Pre-requisites and Requirements

Before starting, ensure you are familiar with:

  • Python and AnnData
  • Single-cell data analysis (see this tutorial for a primer on the subject) You can run this tutorial locally (tested on an M3 MacBook with 32 GiB memory) or in Google Colab using a T4 instance. Environment setup will be covered in a later section.

Overview

This notebook provides a step-by-step guide to:

  1. Setting up your environment
  2. Downloading the necessary model checkpoints and h5ad dataset
  3. Performing model inference to create embeddings
  4. Visualizing the results with UMAP

Setup

Let's start by setting up dependencies. The released version of scGPT requires PyTorch 2.1.2, so we will remove the existing PyTorch installation and replace it with the required one. If you want to run this on another environment, this step might not be necessary.

%%capture [--no-stderr]

!pip freeze |grep torch
!pip uninstall -y -q torch torchvision
!pip install -q torchvision==0.16.2 torch==2.1.2

!pip install -q scgpt scanpy gdown

We can install the rest of our dependencies and import the relevant libraries.

%%capture [--no-stderr]

# Import libraries

import warnings
import urllib.request
from pathlib import Path

import scgpt as scg
import scanpy as sc
import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")

Download Model Checkpoints and Data

Let's download the checkpoints from the scGPT repository.

# Filter warnings

warnings.simplefilter("ignore", ResourceWarning)
warnings.filterwarnings("ignore", category=ImportWarning)

# Use gdown with the recursive flag to download the folder
# Replace the folder ID with the ID of your folder
folder_id = '1oWh_-ZRdhtoGQ2Fw24HP41FgLoomVo-y'

# Download the folder and its contents recursively
!gdown --folder {folder_id}

We will now download an H5AD dataset from CELLxGENE. To reduce memory utilization, we will also perform a reduction to the top 3000 highly variable genes using scanpy's highly_variable_genes function.

%%capture [--no-stderr]

uri = "https://datasets.cellxgene.cziscience.com/f50deffa-43ae-4f12-85ed-33e45040a1fa.h5ad"
source_path = "source.h5ad"
urllib.request.urlretrieve(uri, filename=source_path)
adata = sc.read_h5ad(source_path)

batch_key = "sample"
N_HVG = 3000

sc.pp.highly_variable_genes(adata, n_top_genes=N_HVG, flavor='seurat_v3')
adata_hvg = adata[:, adata.var['highly_variable']]

We can now use embed_data to generate the embeddings. Note that gene_col needs to point to the column where the gene names (not symbols!) are defined. For CELLxGENE datasets, they are stored in the feature_name column.

%%capture [--no-stderr]
#warnings.simplefilter("ignore", ResourceWarning)

model_dir = Path("./scGPT_human")

gene_col = "feature_name"
cell_type_key = "cell_type"

ref_embed_adata = scg.tasks.embed_data(
    adata_hvg,
    model_dir,
    gene_col=gene_col,
    obs_to_save=cell_type_key,  # optional arg, only for saving metainfo
    batch_size=64,
    return_new_adata=True,
)

Our scGPT embeddings are stored in the .X attribute of the returned AnnData object and have a dimensionality of 512.

ref_embed_adata.X.shape
# Output:
    (11103, 512)

We can now calculate neighbors based on scGPT embeddings.

sc.pp.neighbors(ref_embed_adata, use_rep="X")
sc.tl.umap(ref_embed_adata)

We will put our calculated UMAP and embeddings in our original adata object with our original annotations.

adata.obsm["X_scgpt"] = ref_embed_adata.X
adata.obsm["X_umap"] = ref_embed_adata.obsm["X_umap"]

We can also switch our .var index which is currently set to Ensembl ID's, to be gene symbols, allowing us to plot gene expression more easily.

# Add the current index ('ensembl_id') as a new column
adata.var['ensembl_id'] = adata.var.index

# Set the new index to the 'feature_name' column
adata.var.set_index('feature_name', inplace=True)
# Add a copy of the gene symbols back to the var dataframe
adata.var['gene_symbol'] = adata.var.index

We can now plot a UMAP, coloring it by cell type to visualize our embeddings. Below, we color by both the standard cell type labels provided by CELLxGENE and the original cell type annotations from the authors. The embeddings generated by scGPT effectively capture the structure of the data, closely aligning with the original author annotations.

with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    #sc.pp.neighbors(ref_embed_adata, use_rep="X")
    #sc.tl.umap(ref_embed_adata)
    sc.pl.umap(adata, color=["cell_type", "annotation_res0.34_new2"], wspace = 0.6)
UMAP plot colored by cell type and annotation

We can also take a look at some markers of the major cell types represented in the dataset.

sc.pl.umap(adata, color=['cell_type', 'MKI67', 'LYZ', 'RBP2', 'MUC2', 'CHGA', 'TAGLN', 'ELAVL3'], frameon=False, use_raw=False, legend_fontsize ="xx-small", legend_loc="none")
Major cell types and markers

References

Please refer to the following papers for information about:

scGPT: Toward building a foundation model for single-cell multi-omics using generative AI

Cui, H., Wang, C., Maan, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods 21, 1470–1480 (2024). https://doi.org/10.1038/s41592-024-02201-0

The dataset used in this tutorial

Moerkens, R., Mooiweer, J., Ramírez-Sánchez, A. D., Oelen, R., Franke, L., Wijmenga, C., Barrett, R. J., Jonkers, I. H., & Withoff, S. (2024). An iPSC-derived small intestine-on-chip with self-organizing epithelial, mesenchymal, and neural cells. Cell Reports, 43(7). https://doi.org/10.1016/j.celrep.2024.114247

CELLxGENE Discover and Census

CZ CELLxGENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data CZI Single-Cell Biology, et al. bioRxiv 2023.10.30; doi: https://doi.org/10.1101/2023.10.30.563174

Responsible Use

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.

Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.