
Probabilistic modeling and analysis of single-cell omics data in rare disease
Learning Goals
This tutorial will guide you on how to leverage the scVI model and embeddings to analyze a rare disease dataset. Specifically, you will learn how to:
- Access the model: we will use the scVI model, trained on 44 million cells from the CZ CELLxGENE Census.
- Process your data through the model, generating embeddings that allow for comparisons between your dataset and the CZ CELLxGENE Census.
- Search for similar cells within the CZ CELLxGENE data corpus to find reference data for further exploration and validation.
Pre-requisites and requirements
This tutorial assumes a basic understanding of Python, AnnData, and single-cell data analysis (see this tutorial for a primer on the subject). You can run this tutorial locally depending on your hardware. It was originally run on an M3 MacBook with 32 GiB of memory.
If running locally, environment setup will be addressed in a later section.
Overview
Below is a detailed overview of what we will cover:
- Environment Setup and Quick Start
- Rare Disease Case Study
- Vignette: Using scVI models:
- Project rare disease data into the scVI embedding space and visualize it.
- Find similar cells in the CZ CELLxGENE corpus based on scVI embeddings.
- Explore standard cell type annotations for the discovered cells.
Introduction
In this tutorial, we will explore how to use probabilistic modeling (in this example, scVI trained on the 44 million cells from CZ CELLxGENE) to analyze rare disease data. The study of rare diseases presents unique challenges, including limited sample sizes, heterogeneous datasets, and complex biological mechanisms. However, with increasingly performant models and access to large amounts of high-quality, standardized data, we can now dive deeper into these complex datasets than ever before.
Models like scVI leverage large datasets to generate compact, informative representations of gene expression profiles, allowing us to interrogate biology across multiple dimensions while correcting for technical noise. These models enable dataset harmonization at scale, making it possible to compare samples across extensive data corpora composed of datasets from diverse sources and designs.
Quick-Start
If you want to jump directly into the code, please refer to the quick start tutorial for scVI.
Setup
Step 1: Colab (Recommended)
We recommend running this tutorial via colab since most of the environment set up will be taken care of for you by running this notebook.
To start, connect to the T4 GPU
runtime hosted for free by Google Colab using the dropdown menu in the upper right hand corner of this notebook.
Note that this tutorial will use commands written for Google Colab, and some of those commands may need to be modified to work with other computing setups.
You can check what version of python is running in the Colab environment by running the following command.
!python --version
To get started with this tutorial, you’ll need to set up the appropriate environment and download the necessary resources.
You can access the notebook for this tutorial in the following location. Additionally, you can download the rare disease dataset we’ll be analyzing from this link.
Step 1A: Local (Alternative)
We recommend setting up the tutorial environment with Conda to manage dependencies easily. If you don’t already have Conda installed, follow the installation instructions on this page. For this tutorial, the Miniconda distribution will work well.
Once Conda is installed, you can create and activate a Python 3.11-based Conda environment using the following bash (terminal) commands:
conda create -n "vcp-env" python=3.11
conda activate vcp-env
Additionally, you will need to download the necssary resources to run the tutorial locally. You can access the notebook for this tutorial in the following location. Additionally, you can download the rare disease dataset we’ll be analyzing from this link
Step 2: Install requirements
Run the following command to install the required packages (you may see some errors related to pip's dependency resolvers but this should not impact your ability to run the tutorial):
!pip install scvi-tools==1.2.0 tiledbsoma==1.14.4 'cellxgene_census[experimental]'==1.16.2 scanpy==1.10.3 jupyterlab jupyter ipywidgets
Step 3: Download data
If you aren't running this tutorial in colab, you can download the AnnData file for this tutorial here. Otherwise, we will download the dataset in the next code cell.
!gdown --fuzzy https://drive.google.com/file/d/13v6fuGqdZvKeUp-XcWUqA7hWiE8Hiv7y/view?usp=sharing
In the output of the code above, colab will specify the location name of the downloaded file. In case you want to ensure that file has been downloaded, you list out the contents of our current working directory.
!ls
# Output:
sample_data test_anndata.h5ad 'view?usp=sharing'
Step 4: Check environment
To verify that your environment is correctly configured:
- Launch the notebook (if running locally):
jupyter lab ./vcp-tutorial-scvi.ipynb
- Run import statements (below) to ensure all dependencies are correctly installed. Note that there are some optional imports that might be necessary depending on your compute environment (if running this notebook within an HPC environment)
#utils
import os,sys
import functools
import gc
import warnings
import pprint
import yaml
import json
warnings.filterwarnings('ignore')
#scvi-tools, CZ CELLxGENE Census and TileDB SOMA
import cellxgene_census
from cellxgene_census.experimental import get_embedding
import tiledbsoma as soma
#import tiledb.cloud - optional may be need for some HPC and cluster environments
import scvi
#single cell
import scanpy as sc
import anndata as ad
#scientific computing
import numpy as np
import pandas as pd
import scipy.sparse as sp
from sklearn.ensemble import RandomForestClassifier
#plotting
import seaborn as sn
import matplotlib.pylab as plt
# Configure Global Variables
## Set latest Census Version
CENSUS_VERSION = "2024-07-01"
# structure/filter notebook output
pp = pprint.PrettyPrinter(indent=4)
%matplotlib inline
Use Case
Challenges in Studying Rare Disease
Rare diseases present unique challenges—small patient populations, variable disease presentation, and limited research—making diagnosis and treatment difficult. However, recent advances in AI and large-scale biological datasets offer new ways to overcome these barriers.
Key Challenges
1. Limited Data & Expertise
- Small patient pools restrict clinical trials and research and scarcity of data limits AI model performance.
- Many clinicians lack experience, leading to diagnostic delays and inconsistent therapeutic strategy.
2. Variable Clinical Presentations
- Disease presentation, symptoms, and severity differ widely between patients
3. Research Gaps
- Many rare diseases remain understudied with limited treatment options.
- AI models trained on common diseases struggle to generalize due to insufficient rare disease data.
Strategy: Models and Embeddings to Find Similar Diseases
Although generating rare disease data is challenging, AI models have become powerful tools for uncovering patterns and relationships in biological datasets. Let’s build an intuition for how these models work and what they can achieve in rare disease research.
What Are Models and Embeddings?

Models
Think of a model as a tool for recognizing patterns and relationships in complex biological data. During training, the model looks at data-like gene expression levels across different cells-and learns which features (e.g., specific genes) are important for distinguishing cell types. Once trained, the model acts as a translator, converting new raw data into a structured format, called an embedding, that reflects the relationships it has learned.
This means that even new datasets, like rare disease samples, can be fed through the model to align them within the same embedding space as the original data. This alignment makes it easier to compare and discover biological similarities between rare disease samples and well-studied reference datasets.
Embeddings
Embeddings are like coordinates on a map. The model uses them to represent each cell or patient as a point in a simplified space, where similar cell types or patients end up closer together. For example, if two cells are biologically similar, their embeddings will be closer in this space, just as two nearby cities are close on a map.
Why Use Models and Embeddings for Rare Diseases?
Embeddings simplify complex biological data, such as gene expression profiles, into a format where relationships between cells, patients, or conditions are easier to spot. Here's how that helps:
- Comparing Rare Disease Data: Even with limited patient populations, embeddings allow us to compare patients with similar biological profiles across datasets, even if they have different diagnoses.
- Discovering Shared Mechanisms: By grouping patients or cells with similar embeddings, we can uncover shared disease mechanisms and identify subgroups that might respond to similar treatments.
- Transferring Knowledge Across Datasets: Since trained models “remember” patterns from the data they were trained on, we can apply them to new datasets. This allows researchers to efficiently compare rare disease data with larger reference datasets to find biological similarities and potential treatments.
Overview of Diffuse Midline Glioma (DMG)
In this tutorial, we will use models and embeddings to study Diffuse Midline Gliomas (DMGs), aggressive brain tumors that predominantly affect children. These tumors are often linked to a replacement of lysine with methionine at position 27 in histone 3 (H3-K27M), which disrupts normal brain cell development.
DMGs typically arise in midline brain structures, with tumor locations varying by age:
- Pons: Most common in younger children (3-7 years).
- Thalamus: More frequent in older children (7-12 years).
The H3-K27M mutation is thought to interfere with oligodendrocyte development. leading to the accumulation of oligodendrocyte precursor cells (OPCs) within the tumor. Unfortunately, DMGs have limited treatment options and are associated with a poor prognosis.
Challenges in Studying DMG and the Role of Single-Cell Sequencing
Studying DMGs presents unique challenges due to small patient populations and limited data, but single-cell sequencing provides critical insights:
- Identify cell types: Distinguishes between malignant and non-malignant cells and reveals the tumor’s heterogeneity.
- Track developmental programs: Uncovers how disrupted developmental pathways drive tumor progression.
- Pinpoint therapeutic targets: Highlights specific genes and pathways for potential treatments.
Run Model Inference
Load Rare Disease Data
Let’s put everything into practice! We’ll begin by inspecting our data, which comes from the study 'The landscape of tumor cell states and spatial organization in H3-K27M mutant diffuse midline glioma across age and location' by Liu et al., 2022. This dataset contains single-cell transcriptomic data (SMART-seq) from tumors of patients with diffuse midline glioma.
We’ll begin by loading our data into an AnnData object. In our dataset, named adata, the raw gene expression count matrix will be stored in adata.X, and the adata.var index will contain features labeled as Ensembl Gene IDs.
adata = sc.read('./raw-data/filbin-2022/test_anndata.h5ad') # Replace './raw-data/...' with the path where your data is saved
#display the adata object
adata
# Output:
AnnData object with n_obs × n_vars = 4417 × 20348
obs: 'sample', 'age', 'location', 'annotation', 'original_id', 'batch'
var: 'modality', 'gene_symbol'
Key Information in Our AnnData Object
Here’s a quick breakdown of the important information contained in the AnnData object:
- AnnData Structure
- X: Gene expression matrix (4417 cells × 20,348 genes) - note that we need to run the raw counts through the scVI model
- obs: Cell-level metadata (4417 observations/cells)
- Key annotations in the dataset:
- sample: Donor ID
- age: Pediatric or adult
- location: Pontine or thalamic
- annotation: Cell type or state
- Key annotations in the dataset:
Preparing the Data for the scVI Model
Next, we’ll prepare our AnnData object to interact with the scVI model. This includes calculating the number of detected counts per cell and making some structural adjustments to ensure compatibility with the model. Additionally, the model requires a batch field, as it was trained on the CZ CELLxGENE Data Corpus with batch conditioning. To proceed, we’ll set the batch field to unassigned for the forward pass through the model.
adata.obs["n_counts"] = adata.X.sum(axis=1)
adata.obs["joinid"] = list(range(adata.n_obs))
# initialize the batch to be unassigned. This could be any dummy value.
adata.obs["batch"] = "unassigned"
adata.obs['disease'] = "diffuse midline glioma" # add a general disease annotation to our data
Calling the scVI Model and Generating Embeddings
To load our model, we’ll use the CZ CELLxGENE Census API to locate and download the trained scVI model. Before making an API call, we can specify the Census version.
The CZ CELLxGENE Census is continuously updated, with long-term stable (LTS) releases published every six months. Each LTS release includes an updated version of the scVI model, retrained on newly added data since the previous release. For this tutorial, we’ll set the Census version to the latest LTS release available at the time of writing.
# We previously set this variable with our import statements
print(CENSUS_VERSION)
# Output:
2024-07-01
We can retrieve the URI where the model is stored using the get_embedding_metadata_by_name function from the experimental release of the Census.
We can retrieve the URI where the model is stored using the get_embedding_metadata_by_name method from the experimental release of the Census.
# Retrieve scVI model from CZ CELLxGENE Census API
with cellxgene_census.open_soma(census_version=CENSUS_VERSION) as census:
census = cellxgene_census.open_soma(census_version=CENSUS_VERSION)
scvi_info = cellxgene_census.experimental.get_embedding_metadata_by_name(
embedding_name="scvi",
organism="homo_sapiens",
census_version=CENSUS_VERSION,
)
scvi_info["model_link"]
Output:
's3://cellxgene-contrib-public/models/scvi/2024-07-01/homo_sapiens/model.pt'
We’ll use the following magic commands to create a directory for storing the model object and download it. If wget is not already installed, you can install it or use curl as an alternative.
## download model
!mkdir -p scvi-human-2024-07-01
!wget --no-check-certificate -q -O scvi-human-2024-07-01/model.pt https://cellxgene-contrib-public.s3.us-west-2.amazonaws.com/models/scvi/2024-07-01/homo_sapiens/model.pt
Once the model has been downloaded, we can retrieve the list of features used to train it. It is important to note that Ensembl IDs were used as features instead of their corresponding gene symbols. Therefore, you should ensure that the index of the var DataFrame consists of Ensembl IDs rather than gene symbols. (Note: You can verify this by running: adata.var.head
or adata.var.index
)
scvi.model.SCVI.prepare_query_anndata(adata,
"scvi-human-2024-07-01",
return_reference_var_names=True)
# Output:
Index([
'ENSG00000000005', 'ENSG00000000971', 'ENSG00000001167',
'ENSG00000001497', 'ENSG00000001626', 'ENSG00000001629',
'ENSG00000002726', 'ENSG00000002746', 'ENSG00000002933',
'ENSG00000003137',
...
'ENSG00000288553', 'ENSG00000285793', 'ENSG00000285837',
'ENSG00000285859', 'ENSG00000288703', 'ENSG00000288637',
'ENSG00000288696', 'ENSG00000288714', 'ENSG00000288694',
'ENSG00000285968'
], dtype='object', length=8000)
adata.var.head()
Output:
modality gene_symbol
ensembl_id
ENSG00000121410 Gene Expression A1BG
ENSG00000268895 Gene Expression A1BG-AS1
ENSG00000148584 Gene Expression A1CF
ENSG00000175899 Gene Expression A2M
ENSG00000245105 Gene Expression A2M-AS1
In this step, we’ll prepare and register our data with the scVI model. Specifically, we’ll use the scVI function prepare_query_anndata to align the feature space of our query AnnData with the one used to train the model. The model’s feature space is based on highly variable genes identified across the CZ CELLxGENE Data Corpus.
Once the alignment is complete, we’ll perform a forward pass through the model to generate the latent representation (also known as embeddings). This step gives us a new representation of our query dataset, now “informed” by the entire CELLxGENE corpus.
After obtaining the embeddings, we’ll store them in our data object for further analysis and visualization. Importantly, even if the feature spaces between the model and the query AnnData don’t fully align, we can still run the model and generate meaningful embeddings.
scvi.model.SCVI.prepare_query_anndata(adata,
"scvi-human-2024-07-01",
inplace=True)
vae_q = scvi.model.SCVI.load_query_data(
adata,
"scvi-human-2024-07-01",
)
# This allows for a simple forward pass
vae_q.is_trained = True
latent = vae_q.get_latent_representation()
adata.obsm["scvi"] = latent
We can inspect the dimensions of our new embeddings. This representation of the data can be treated as a set of features, analogous to a PCA representation, allowing it to serve as input for standard clustering and visualization algorithms. Note that the dimensionality of our scVI embedding space is 50, which is much smaller than the original gene expression space.
adata.obsm['scvi'].shape
Output:
(4417, 50)
Analysis of Model Outputs
We can clean up our object a little and normalize the raw counts in our object. Will perform this normalization so that we can visualize gene expression in relation to our embedding space (remember that we ran our raw count data through the scVI model to generate)
# filter out missing features
adata = adata[:, adata.var["gene_symbol"].notnull().values].copy()
adata.var.set_index("gene_symbol", inplace=True)
#normalize
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
Here we can calculate neighbors and create a visualization based on the scVI
embedding that we have calculated. We will store that as an additional layer in the obsm
slot of our AnnData object.
# Calculate neighbors based on scvi space
sc.pp.neighbors(adata, n_neighbors=20, use_rep="scvi")
# Calculate UMAP
umap_scvi = sc.tl.umap(adata, copy = True, n_components = 2)
# Store embedding in our original object
adata.obsm['X_umap_scvi'] = umap_scvi.obsm['X_umap']
We can plot the UMAP based on our scVI
embeddings and color it by the author’s cell type annotations to gain a qualitative sense of how well our embedding captures variation within this dataset. The major cell types are well-separated (e.g., microglia and oligodendrocytes are clearly distinct from the malignant cells, which have been annotated with different cancer-driving programs). Even within the malignant cells, we observe good separation among the major oncogenic programs identified previously by the authors.
sc.pl.umap(umap_scvi, color="annotation")

We can plot the scVI
embedding and overlay additional categorical metadata fields to explore potential relationships between cell-level annotations and other covariates in the data. It is important to consider the sample distribution in our visualization. Notably, there is good mixing between the different donors in this dataset, which suggests that the scVI
representation captures relevant biological variation without overemphasizing technical variation.
umap_scvi.obs.columns # We can see what categorical metadata are available to overlay
Output:
Index(['sample', 'age', 'location', 'annotation', 'original_id', 'batch',
'n_counts', 'joinid', 'disease', '_scvi_batch', '_scvi_labels'],
dtype='object')
sc.pl.umap(umap_scvi, color = ["sample", "location", "age", "annotation"])

Visualizing Gene Expression Signatures of Distinct Oncogenic Programs Identified by Filbin et al., 2019
Using scVI based UMAP visualization, we can also plot gene expression across the embeddings to confirm that the organization of cell subpopulations in embedding space reflects their transcriptional identities. These subpopulations align with distinct oncogenic programs identified in malignant cells by the authors of the original study. In that paper, the authors identified that disregulation of OPC development resulted in four different cancer driving programs. These were transcriptionally identified by the presence of marker genes that made the malignant cells similar to OPCs, mature oligodendrocytes, mature astrocytes, and proliferating cells. You can refer to the original study in the references section below for more information. Without any fine-tuning of the model, we can directly leverage these embedded representations to uncover nuanced structures within this rare disease dataset.
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
sc.pl.umap(umap_scvi, color=["PDGFRA", "CCND1", "ASCL1"]) # oligodendrocyte precursor cell like oncogeneic program
sc.pl.umap(umap_scvi, color=["CDK1", "RRM2", "TOP2A"]) # cell cyle oncogenic program
sc.pl.umap(umap_scvi, color=["VIM", "GFAP", "APOE"]) # astrocytic differentiation like oncogenic program
sc.pl.umap(umap_scvi, color=["PLP1", "MBP", "TF"]) # oligodendrocytic differentiation like oncogenic program




Search For Similar Cells in the CZ CELLxGENE Census.
Now that we have generated embeddings using our scVI model, we can leverage this representation to find similar cells within the entire CZ CELLxGENE Census. Every cell in the CZ CELLxGENE Census Corpus has an associated scVI embedding, as all cells have been processed through the trained model. By calculating distances in this embedding space, we can identify similar cells (neighbors) across the CZ CELLxGENE data corpus. Note that in the function below, we can specific the number of neighbors that want to identify from the census for each of our query cells. Note that this search is a stochastic process, the neighbors that are found for each of our query cells will slightly vary between different runs. This may result in a slightly different number of total neighbors identified between runs (lists of neighbors between query cells may overlap and contain some of the same neighbor IDs which will be slightly different between runs).
%%time
neighbors = cellxgene_census.experimental.find_nearest_obs(
"scvi", "homo_sapiens", CENSUS_VERSION, query=adata, k=30, memory_GiB=8, nprobe=20
)
After identifying the neighbors, we will receive an object containing: (1) a list of k neighbors for each query cell, and (2) a set of k distances corresponding to each query cell and its neighbors.
neighbors
# Output:
NeighborObs(
distances=array([
[1.1933558, 1.2045343, 1.2085202, ..., 1.3784499, 1.3905808, 1.3944349],
[0.41559666, 0.45637673, 0.4766558, ..., 0.88151574, 0.904114, 0.91997015],
[1.6880848, 1.6880848, 1.6880848, ..., 2.508525, 2.5415115, 2.5415115],
...,
[3.41161, 3.9026787, 4.0605674, ..., 5.122192, 5.1230273, 5.1458583],
[1.9467123, 2.1676705, 2.5587058, ..., 3.0402641, 3.052857, 3.056239],
[4.4184947, 4.513351, 4.539205, ..., 5.621504, 5.6443577, 5.6443577]
], dtype=float32),
neighbor_ids=array([
[52848883, 65447037, 42721641, ..., 33924365, 52830607, 52856140],
[52359398, 52364496, 45183091, ..., 52753928, 45403307, 45370154],
[1054738, 18590796, 69795503, ..., 18590345, 18590834, 1055340],
...,
[22806387, 52345854, 13749428, ..., 13345170, 52348758, 28068487],
[39312362, 28808726, 17383987, ..., 28612799, 39420984, 28674769],
[52480202, 33066985, 3587018, ..., 41861861, 23302045, 50091838]
], dtype=uint64)
)
Retrieve Expression Data of Similar Cells from CZ CELLxGENE Census
With our list of similar cells from the Census, we can construct an AnnData object containing their expression data. This dataset can serve as a reference for comparison or help identify complete datasets that could be used as a reference for our DMG data.
In the query below, we can also specify which categorical metadata to include in the downloaded dataset.
%%time
with cellxgene_census.open_soma(census_version=CENSUS_VERSION) as census:
neighbors_adata = cellxgene_census.get_anndata(
census,
"homo_sapiens",
"RNA",
obs_coords=sorted(neighbors.neighbor_ids[:, 0].tolist()),
obs_embeddings=["scvi"],
X_name="normalized",
column_names={"obs": ["soma_joinid", "tissue", "tissue_general", "cell_type", "disease", "dataset_id"]},
)
neighbors_adata.var_names = neighbors_adata.var["feature_id"]
We can inspect the returned AnnData object and observe the following information:
.X
: Gene expression raw counts for 2,764 cells × 60,530 genes (the number of genes corresponds to the full feature set across the entire CZ CELLxGENE corpus; filtering on expression would remove many of these).obs
: A set of standard metadata from the CZ CELLxGENE data corpus.obsm
: Contains the scVI embeddings.var
: Contains our feature annotations.
neighbors_adata
Output:
AnnData object with n_obs × n_vars = 2762 × 60530
obs: 'soma_joinid', 'tissue', 'tissue_general', 'cell_type', 'disease', 'dataset_id'
var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length', 'nnz', 'n_measured_obs'
obsm: 'scvi'
Let's calculate a UMAP visualization based on the pre-generated scVI embeddings
# Calculate neighbors based on scvi space
sc.pp.neighbors(neighbors_adata, n_neighbors=10, use_rep="scvi")
# Calculate UMAP
neighbors_adata = sc.tl.umap(neighbors_adata, copy = True, n_components = 2)
# Removing unused categories from categorical metadata for plotting convenience
neighbors_adata.obs["cell_type"] = neighbors_adata.obs["cell_type"].cat.remove_unused_categories()
neighbors_adata.obs["tissue"] = neighbors_adata.obs["tissue"].cat.remove_unused_categories()
neighbors_adata.obs["tissue"] = neighbors_adata.obs["tissue_general"].cat.remove_unused_categories()
Let's visualize the tissue of origin for our neighbors. It looks like the majority of our data is derived from the brain, which makes sense given that DMG is a brainstem tumor. We can list out the additional tissues represented and indeed see that despite a long tail, most of the neighbors returned are from the brain. The embryo data might be interesting too as DMG is a disease which has a developmental component to it.
sc.pl.umap(neighbors_adata, color="tissue_general")

Figure: UMAP visualization of scVI embeddings.
neighbors_adata.obs['tissue'].value_counts()
Output:
count
tissue
brain 1897
embryo 131
heart 123
breast 85
small intestine 83
reproductive system 56
colon 38
kidney 38
eye 37
lung 35
skin of body 29
adipose tissue 22
placenta 19
uterus 18
blood 16
bone marrow 14
liver 11
digestive system 9
pancreas 9
spinal cord 9
large intestine 8
spleen 8
lymph node 7
fallopian tube 7
esophagus 6
ovary 6
pleural fluid 5
stomach 5
exocrine gland 4
endocrine gland 4
musculature 4
central nervous system 3
intestine 3
respiratory system 3
nose 2
mucosa 2
prostate gland 2
pleura 2
tongue 1
vasculature 1
dtype: int64
We can also list the cell types of the identified neighbors. Interestingly, it appears that we’ve identified a group of malignant cells as neighbors to our DMG sample. Additionally, we’ve identified some oligodendrocytes and oligodendrocyte precursor cells among the neighbors. This is a promising result, as it indicates that our search is capturing relevant cell types. DMG is associated with the dysregulation of the development from oligodendrocyte precursor cells to mature oligodendrocytes.
neighbors_adata.obs['cell_type'].value_counts()
Output:
count
cell_type
malignant cell 972
oligodendrocyte precursor cell 637
oligodendrocyte 119
fibroblast 82
neuron 78
... ...
basal cell of prostate epithelium 1
ovarian surface epithelial cell 1
pancreatic stellate cell 1
parietal epithelial cell 1
secretory cell 1
128 rows × 1 columns
dtype: int64
We can examine how many datasets these cells originate from using the dataset_id
metadata field. It appears that a specific dataset matches well with ours, so let’s proceed and download that dataset.
neighbors_adata.obs['dataset_id'].value_counts()
Output:
count
dataset_id
56c4912d-2bae-4b64-98f2-af8a84389208 1124
ae4f8ddd-cac9-4172-9681-2175da462f2e 189
c888b684-6c51-431f-972a-6c963044cef0 150
b62f728b-a7eb-4889-8162-b3d457de93b9 131
1a38e762-2465-418f-b81c-6a4bce261c34 82
... ...
2e5a9b5d-d31b-4e9f-a179-d5d70ba459fb 0
8b2e5453-faf7-46ea-9073-aea69b283cb7 0
8a8aedcb-5bb3-453d-a9f0-f37951ae1515 0
8a554710-08bc-4005-87cd-da9675bdc2e7 0
2c820d53-cbd7-4e0a-be7a-a0ad1989a98f 0
678 rows × 1 columns
dtype: int64
cxg_data_id = neighbors_adata.obs['dataset_id'].value_counts().index[0]
We can download the original dataset utilizing the download_source_h5ad
fuction. Heads Up! The file we are about to download is 10.5 Gb.
#function below not run for this tutorial
#cellxgene_census.download_source_h5ad(cxg_data_id, to_path="./ref_data.h5ad") - large file 10.5GiB
Alternatively we can also simply grab the metadata for this dataset ID which will be faster to download and we can explore some of the metadata in the dataset.
# Grabbing only metadata
with cellxgene_census.open_soma() as census:
# Reads SOMADataFrame as a slice
cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
value_filter = "dataset_id == '56c4912d-2bae-4b64-98f2-af8a84389208'",
column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
)
# Concatenates results to pyarrow.Table
cell_metadata = cell_metadata.concat()
# Converts to pandas.DataFrame
cell_metadata = cell_metadata.to_pandas()
print(cell_metadata.head())
Output:
The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.
INFO:cellxgene_census:The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.
assay cell_type tissue tissue_general \
0 microwell-seq malignant cell left temporal lobe brain
1 microwell-seq malignant cell left temporal lobe brain
2 microwell-seq macrophage left temporal lobe brain
3 microwell-seq malignant cell left temporal lobe brain
4 microwell-seq malignant cell left temporal lobe brain
suspension_type disease dataset_id
0 cell glioblastoma 56c4912d-2bae-4b64-98f2-af8a84389208
1 cell glioblastoma 56c4912d-2bae-4b64-98f2-af8a84389208
2 cell glioblastoma 56c4912d-2bae-4b64-98f2-af8a84389208
3 cell glioblastoma 56c4912d-2bae-4b64-98f2-af8a84389208
4 cell glioblastoma 56c4912d-2bae-4b64-98f2-af8a84389208
Super exciting! All of the cells in this dataset are annotated as coming from glioblastoma samples. This is fairly interesting because a similar mechanism of pathogenesis, dependent upon incorrect development of oligodendrocytes precursor cells, has been suggested to be the a root cause of glioblastoma. This dataset actually ended up being the Harmonized single-cell landscape, intercellular crosstalk and tumor architecture of glioblastoma[Harmonized single-cell landscape, intercellular crosstalk and tumor architecture of glioblastoma] created by the CZI Neurodegeneration Challenge Network.
cell_metadata['disease'].value_counts()
Output:
count
disease
glioblastoma 1135677
non-specific interstitial pneumonia 0
non-small cell lung carcinoma 0
non-compaction cardiomyopathy 0
neuroendocrine carcinoma 0
... ...
cataract 0
cardiomyopathy 0
breast carcinoma 0
breast cancer 0
congenital heart disease 0
109 rows × 1 columns
dtype: int64
Summary
In this tutorial, we successfully leveraged a pretrained scVI model trained on ~44 million cells from the CZ CELLxGENE corpus. We passed our own data through the model to generate embeddings that accurately recapitulated previously reported biological structure. Through similarity search, we identified a relevant reference dataset, which opens the door for further exploration and analysis.
As potential next steps, you could:
- Train a classifier using the identified reference dataset.
- Perform data integration to combine your dataset with other relevant datasets.
- Fine-tune another foundation model to tackle specific tasks or new objectives.
This workflow demonstrates the power of foundation models in single-cell analysis, allowing for rapid insights and seamless integration of new datasets with previously established biological knowledge.
Contact and acknowledgements
We acknowledge the developers of the scVI model Can Ergen and Nir Yosef, and the CZ CELLxGENE team for training this specific version of the model. For questions about the model, you can contact the CZ CELLxGENE team via cellxgene@chanzuckerberg.com.
References
Please refer to the following papers for more information about:
Development of the scVI model:
Lopez, R., Regier, J., Cole, M.B. et al. Deep generative modeling for single-cell transcriptomics. Nat Methods 15, 1053–1058 (2018). https://doi.org/10.1038/s41592-018-0229-2
Training data for the scVI model via CZ CELLxGENE Discover and Census API
CZ CELLxGENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data CZI Single-Cell Biology, et al. bioRxiv 2023.10.30; doi: https://doi.org/10.1101/2023.10.30.563174
Diffuse midline gliomas and the example dataset that was used in this tutorial:
Liu, I., Jiang, L., Samuelsson, E.R. et al. The landscape of tumor cell states and spatial organization in H3-K27M mutant diffuse midline glioma across age and location. Nat Genet 54, 1881–1894 (2022). https://doi.org/10.1038/s41588-022-01236-3
An explanation of the oncogenic programs discovered to contribute to diffuse midline glioma:
Filbin, M. G., Tirosh, I., Hovestadt, V., Shaw, M. L., Escalante, L. E., Mathewson, N. D., Neftel, C., Frank, N., Pelton, K., Hebert, C. M., Haberler, C., Yizhak, K., Gojo, J., Egervari, K., Mount, C., van Galen, P., Bonal, D. M., Nguyen, Q. D., Beck, A., Sinai, C., … Suvà, M. L. (2018). Developmental and oncogenic programs in H3K27M gliomas dissected by single-cell RNA-seq. Science (New York, N.Y.), 360(6386), 331–335. https://doi.org/10.1126/science.aao4750
Neighbor search based on TileDB Vector Search
Norton, I. (2023, August 1). TileDB 101: Vector search. TileDB. Retrieved November 14, 2024, from https://tiledb.com/blog/tiledb-101-vector-search
Responsible Use
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.
Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.