Try Models

Comparison Workflow

Last Updated: January 27, 2025

The Comparison Workflow integrates your data with the CELLxGENE Census reference atlas.

Overview

The Comparison Workflow enhances your dataset by:

  1. Reference Selection - Filter CELLxGENE Census by tissue, disease, etc.
  2. Similarity Matching - Find reference cells similar to your data
  3. Data Integration - Combine your cells with selected reference cells
  4. Joint Analysis - Perform dimensionality reduction on combined dataset
  5. Visualization - Generate UMAP with query and reference cells labeled
  6. Classification - Predict cell types using reference annotations

Requirements

  • Human or mouse data only - Other organisms not supported
  • scVI or TranscriptFormer model - PCA not compatible with comparison
  • Gene overlap - Sufficient genes must match CELLxGENE Census
  • Compatible format - Standard H5AD requirements apply

CELLxGENE Census Integration

What is CELLxGENE Census?

The CELLxGENE Census is a comprehensive reference atlas containing:

  • Millions of cells from healthy and diseased tissues
  • Standardized annotations with consistent cell type labels
  • Quality controlled data with uniform processing
  • Pre-trained models for embedding generation
  • Diverse tissues across human and mouse organisms

Reference Data Selection

The workflow automatically:

  1. Filters by organism (human/mouse) based on your data
  2. Applies tissue filters if specified (brain, blood, lung, etc.)
  3. Samples proportionally to maintain cell type diversity
  4. Limits total cells to computational constraints (250k-1M cells)

Computational Steps

1. Data Preparation

Your H5AD File
- Load and validate dataset structure
- Extract organism from metadata
- Generate model embeddings (scVI/TranscriptFormer)
- Prepare for Census integration

2. Reference Data Query

CELLxGENE Census
- Query based on organism (human/mouse)
- Apply tissue/condition filters if specified
- Sample cells proportionally by cell type
- Download reference metadata and embeddings

3. Similarity Search (Optional)

Query Embeddings + Reference Embeddings
- Build nearest neighbor index (HNSW algorithm)
- Find k=30 nearest neighbors for each query cell
- Select unique reference cells (up to 1M total)
- Filter reference data to similar cells only

Similarity Search Options:

  • Enabled: Uses nearest neighbor search to find most similar reference cells
  • Disabled: Random sampling of all matching reference cells

4. Data Integration

Query Data + Reference Data
- Align gene spaces between datasets
- Handle missing genes with outer join
- Combine observation metadata
- Label cells as "My data" vs "Reference"

Similarity Search Options

Enabled

  • Uses nearest neighbor search to find most similar reference cells
  • Includes up to 1M reference cells most similar to your data
  • More computationally intensive processing

Disabled

  • Random sampling of all matching reference cells
  • Includes up to 250k reference cells from filtered Census data
  • Faster processing

Configuration Options

Reference Filters

Available tissue filters include:

  • General tissue: Brain, blood, lung, heart, liver, kidney, etc.
  • All tissues: No tissue filtering (full organism reference)
  • Multiple selection: Choose multiple tissues simultaneously

Workflow Parameters

  • Model selection: scVI or TranscriptFormer required
  • Similarity search: Enable/disable nearest neighbor matching
  • Tissue filters: Select relevant tissues from dropdown
  • TranscriptFormer variant: Choose appropriate model variant

Expected Results

Processing Time

  • Data preparation: 5-10 minutes
  • Reference download: 10-20 minutes
  • Integration: 15-30 minutes
  • Total time: 30-60 minutes depending on dataset size

Visualization Output

Your completed comparison provides:

Interactive CellxGene Explorer

  • Combined dataset: Your cells + reference cells in same embedding space
  • Cell origin labels: Distinguish "My data" from "Reference" cells
  • Cell type annotations: Reference cell type labels for context
  • Gene expression: Compare expression patterns between datasets
  • Neighborhood analysis: See which reference cells are most similar