Comparison Workflow

Last Updated: January 27, 2025

The Comparison Workflow integrates your data with the CELLxGENE Census reference atlas.

Overview

The Comparison Workflow enhances your dataset by:

Reference Selection - Filter CELLxGENE Census by tissue, disease, etc.
Similarity Matching - Find reference cells similar to your data
Data Integration - Combine your cells with selected reference cells
Joint Analysis - Perform dimensionality reduction on combined dataset
Visualization - Generate UMAP with query and reference cells labeled
Classification - Predict cell types using reference annotations

Requirements

Human or mouse data only - Other organisms not supported
scVI or TranscriptFormer model - PCA not compatible with comparison
Gene overlap - Sufficient genes must match CELLxGENE Census
Compatible format - Standard H5AD requirements apply

CELLxGENE Census Integration

What is CELLxGENE Census?

The CELLxGENE Census is a comprehensive reference atlas containing:

Millions of cells from healthy and diseased tissues
Standardized annotations with consistent cell type labels
Quality controlled data with uniform processing
Pre-trained models for embedding generation
Diverse tissues across human and mouse organisms

Reference Data Selection

The workflow automatically:

Filters by organism (human/mouse) based on your data
Applies tissue filters if specified (brain, blood, lung, etc.)
Samples proportionally to maintain cell type diversity
Limits total cells to computational constraints (250k-1M cells)

Computational Steps

1. Data Preparation

Your H5AD File
↓
- Load and validate dataset structure
- Extract organism from metadata
- Generate model embeddings (scVI/TranscriptFormer)
- Prepare for Census integration

2. Reference Data Query

CELLxGENE Census
↓
- Query based on organism (human/mouse)
- Apply tissue/condition filters if specified
- Sample cells proportionally by cell type
- Download reference metadata and embeddings

3. Similarity Search (Optional)

Query Embeddings + Reference Embeddings
↓
- Build nearest neighbor index (HNSW algorithm)
- Find k=30 nearest neighbors for each query cell
- Select unique reference cells (up to 1M total)
- Filter reference data to similar cells only

Similarity Search Options:

Enabled: Uses nearest neighbor search to find most similar reference cells
Disabled: Random sampling of all matching reference cells

4. Data Integration

Query Data + Reference Data
↓
- Align gene spaces between datasets
- Handle missing genes with outer join
- Combine observation metadata
- Label cells as "My data" vs "Reference"

Similarity Search Options

Enabled

Uses nearest neighbor search to find most similar reference cells
Includes up to 1M reference cells most similar to your data
More computationally intensive processing

Disabled

Random sampling of all matching reference cells
Includes up to 250k reference cells from filtered Census data
Faster processing

Configuration Options

Reference Filters

Available tissue filters include:

General tissue: Brain, blood, lung, heart, liver, kidney, etc.
All tissues: No tissue filtering (full organism reference)
Multiple selection: Choose multiple tissues simultaneously

Workflow Parameters

Model selection: scVI or TranscriptFormer required
Similarity search: Enable/disable nearest neighbor matching
Tissue filters: Select relevant tissues from dropdown
TranscriptFormer variant: Choose appropriate model variant

Expected Results

Processing Time

Data preparation: 5-10 minutes
Reference download: 10-20 minutes
Integration: 15-30 minutes
Total time: 30-60 minutes depending on dataset size

Visualization Output

Your completed comparison provides:

Interactive CellxGene Explorer

Combined dataset: Your cells + reference cells in same embedding space
Cell origin labels: Distinguish "My data" from "Reference" cells
Cell type annotations: Reference cell type labels for context
Gene expression: Compare expression patterns between datasets
Neighborhood analysis: See which reference cells are most similar