Try Models

Tutorial: VariantFormer

VCF2Expression: Individual-Level Gene Expression Prediction with VariantFormer

Estimated time to complete: 15-20 minutes | Model: VariantFormer (1.2B parameters)

Learning Goals

  • Predict tissue-specific gene expression from individual genetic variants
  • Understand how mutations affect gene regulation across 63 human tissues and cell lines
  • Visualize sample-specific expression patterns on interactive anatomograms
  • Interpret the biological impact of genetic variation using a state-of-the-art foundation model

Prerequisites

For this browser tutorial: No setup required - fully interactive playground experience.

To run on your own compute:

  • Hardware: GPU with 40GB+ VRAM (NVIDIA H100 recommended) Inference timing: ~30 seconds per gene across 63 tissues
  • Input Data: VCF file with genetic variants (GRCh38 reference genome)
  • Model: Pre-trained VariantFormer checkpoint (14GB)
  • Software: VariantFormer package with anatomogram visualization components

Setup

To run VariantFormer on your own compute, you'll need to set up the environment and install dependencies.

Complete setup instructions: VariantFormer GitHub Setup Guide

This includes:

  • Installing the VariantFormer package and dependencies
  • Downloading model checkpoints from public S3 bucket
  • Setting up reference genome files (GRCh38)

The browser playground below doesn't require any setup.

Understanding VariantFormer

VariantFormer is a 1.2-billion-parameter transformer foundation model that predicts how an individual's unique combination of genetic variants affects gene expression across all major tissues of the human body. This tutorial demonstrates how VariantFormer can be used to analyze variant-to-expression relationships at the individual level—the first model capable of cross-gene, cross-tissue expression prediction from individual whole genomes.

Key Innovations

Mutation-Aware Architecture

  • Two-stage hierarchical design processes both regulatory regions and gene sequences
  • Pre-trained encoders capture variant effects on cis-regulatory elements (cCREs) and gene bodies
  • 25-layer CRE modulator and 25-layer gene modulator transformer stacks model complex regulatory interactions
  • Cross-attention mechanisms link distal regulatory elements to target genes

Tissue Specificity

  • Tissue context embeddings enable predictions across 63 GTEx tissues and cell lines
  • Captures tissue-specific regulatory effects (e.g., brain vs. liver expression differences)
  • Trained on paired whole-genome sequencing and RNA-seq data from GTEx v8

Scientific Validation

  • Model attention patterns correlate with experimental chromatin accessibility data
  • Validated against independent eQTL effect sizes across diverse populations
  • Predictions align with known genotype-phenotype associations in disease cohorts

Analysis Workflow

  1. Load VCF → Input genetic variants (SNPs, indels) in standard VCF format
  2. Select Genes & Tissues → Choose genes and tissues to analyze across organ systems
  3. VariantFormer Processing → Model predicts individual-level expression values
  4. Interactive Visualization → Explore results using anatomogram visualizations

Let's begin by loading a VCF file for analysis.

Running Inference on Your VCF

Load a VCF File

Note for this tutorial: This tutorial uses precomputed predictions for browser-based execution. The VCF loading step below is shown for educational purposes to understand the full workflow in a research setting with GPU access.

Below, we import our VCFProcessor class and specify the location of our VCF (Variant Call Format) file containing an individual's genetic variants.

from processors.vcfprocessor import VCFProcessor  # VariantFormer inference engine

# Initialize processor for VariantFormer PCG model
vcf_processor = VCFProcessor(model_class='v4_pcg')

# Path to your VCF file (GRCh38)
vcf_path = 'path/to/your_sample.vcf.gz'

Configure Analysis Parameters

Selecting Genes and Tissues

VariantFormer can analyze expression for any protein-coding gene across 63 GTEx tissues (55 tissues) and cell lines (8 lines). This comprehensive analysis reveals how genetic variants affect gene regulation differently across organ systems.

Why tissue-specific analysis matters:

  • The same genetic variant can increase expression in one tissue while decreasing it in another
  • Tissue-specific regulatory effects are critical for understanding genotype-phenotype relationships
  • VariantFormer's tissue embeddings capture context-dependent gene regulation mechanisms

Tissue Coverage:

Analysis includes all major organ systems: 16 brain regions, cardiovascular (heart, blood, arteries), digestive (liver, pancreas, stomach), respiratory (lung), urinary (kidney), musculoskeletal, endocrine, and immune tissues, plus 8 cancer cell lines commonly used in genomics research.

Explore Available Tissues and Genes

Before querying data, let's explore what tissues and genes are available in the precomputed dataset. This will help us understand the scope of available predictions and choose appropriate targets for our analysis.

# Get available genes and tissues from VariantFormer
genes_df = vcf_processor.get_genes()  # ~18,439 protein-coding genes
tissues = vcf_processor.get_tissues()  # 63 tissues (55 GTEx + 8 cell lines)

# Select genes for analysis
selected_genes = ['ENSG00000130203']  # APOE gene ID example
selected_tissues = list(tissues)  # All 63 tissues

Prepare Query Data

Now we'll prepare our query data, which specifies:

  • gene_id: Which gene we want to retrieve predictions for
  • tissues: Which tissues/cell types we're interested in For comprehensive anatomogram visualization, we'll query multiple tissues.
# Create query DataFrame
query_df = pd.DataFrame({
    'gene_id': selected_genes,
    'tissues': [','.join(selected_tissues)]
})

# Create PyTorch dataset and dataloader for VariantFormer
vcf_dataset, dataloader = vcf_processor.create_data(vcf_path, query_df)

print(f"Dataset: {len(vcf_dataset)} samples, {len(dataloader)} batches")

Running VariantFormer Predictions

In research environments with GPU infrastructure, VariantFormer processes the individual's genetic variants through its neural architecture:

1. Model Loading

  • Loading 1.2 billion parameters including tissue-specific embedding modules
  • Initializing mutation-aware encoders and 50-layer transformer architecture
  • Setting up GPU acceleration for efficient inference

2. Variant Processing

  • Parsing genetic variants from VCF (SNPs, indels)
  • Mapping variants to cis-regulatory elements and gene bodies
  • Tokenizing sequences with BPE vocabulary

3. Expression Prediction

  • CRE modulator analyzes regulatory perturbations
  • Gene modulator integrates sequence context
  • Tissue-specific embeddings produce predictions

4. Output Generation

  • Individual-level expression values per gene-tissue pair
  • Results formatted for visualization
# Load VariantFormer model (~2-3 minutes, one-time)
model, checkpoint_path, trainer = vcf_processor.load_model()

# Run predictions (~30 seconds per gene on H100)
expression_predictions = vcf_processor.predict(
    model, checkpoint_path, trainer, dataloader, vcf_dataset
)

# Results DataFrame: predicted expression per gene-tissue pair
print(f"Predictions shape: {expression_predictions.shape}")
print(expression_predictions.head())

The playground below uses precomputed predictions for instant results:

Analyze Results

Let's examine the retrieved prediction results in detail. The output contains:

  • Original query information (gene_id, tissue_name, tissue_id)
  • predicted_expression: Precomputed gene expression level predictions
  • embeddings: High-dimensional representations capturing regulatory context

We'll explore the structure of the results and provide some basic analysis.

Interactive Expression Visualization

Understanding Individual-Level Expression Predictions

The anatomogram displays predicted gene expression levels across human tissues conditioned on the individual's variant profile. This visualization shows how variant-specific regulatory effects manifest across different tissue contexts.

How to Interpret the Results:

Color Intensity

  • Warmer colors (red/yellow) = Higher predicted expression
  • Cooler colors (blue/purple) = Lower predicted expression
  • Gray = No data or very low expression
  • Values represent log2-transformed RNA abundance predictions

Interactive Features

  • Hover over tissues to see expression values and tissue annotations
  • Click to view UBERON tissue ontology details
  • Switch tabs for male, female, and brain-specific anatomical views
  • Enhanced tooltips provide tissue system classifications

Scientific Interpretation

  • Expression levels reflect the combined regulatory effects of all variants affecting the gene
  • Tissue-specific patterns reveal differential regulatory logic across cell types
  • Cross-tissue comparison identifies genes with ubiquitous vs. restricted expression
  • Results can be compared to GTEx population distributions to assess variant rarity

Note: These are model predictions conditioned on the individual's variant genotype—they represent the model's learned mapping from variant profiles to expression phenotypes based on GTEx training data.

# Initialize converter for anatomogram visualization
enhanced_converter = EnhancedVCFExpressionConverter(
    aggregation_strategy='mean'
)

# Convert predictions to anatomogram format
anatomagram_data, enhanced_metadata = enhanced_converter.convert_predictions_to_anatomogram(
  expression_predictions,
  gene_name='APOE'
)

# Create interactive multi-view widget
multi_widget = AnatomagramMultiViewWidget(
    visualization_data=anatomagram_data,
    selected_item='APOE',
    available_views=["male", "female", "brain"],
    color_palette='viridis',
    scale_type='linear',
    uberon_names=enhanced_metadata['uberon_names'],
    enhanced_tooltips=enhanced_metadata['enhanced_tooltips']
)

# Display widget (works in Jupyter, Marimo, or Streamlit)
multi_widget

Genome-Wide Expression Analysis

Tissue Hierarchical Clustering Dendrogram

Understanding Tissue Relationships

The dendrogram below shows hierarchical clustering of tissues based on their gene expression correlation patterns. Tissues with similar predicted expression profiles cluster together, revealing biological relationships across organ systems.

Interpretation:

  • Vertical distance: Height indicates dissimilarity between clusters (larger = more different)
  • Branch structure: Tissues that merge at lower heights have more similar expression profiles
  • Biological insight: Clustering often groups tissues by developmental origin or physiological function

Expected patterns:

  • Brain regions cluster together (shared neural regulatory programs)
  • Metabolically active tissues group by function (liver, muscle, adipose)
  • Hormone-responsive tissues may cluster (breast, ovary, prostate)

Interactive Heatmap Visualization

This section creates an interactive clustered heatmap showing expression patterns across ~18,000 genes and 63 tissues. The heatmap uses hierarchical clustering to reveal biological relationships between genes and tissues.

Key Features:

  • Full Matrix: ~18k genes × 63 tissues (~1.1M cells)
  • WebGL Rendering: Efficient pan/zoom for large matrices
  • Hierarchical Clustering: Correlation distance + Ward linkage
  • Transformations: log1p, z-score normalization, outlier clipping
  • Drill-Down: Function to explore specific gene subsets

How to Interpret:

  • Color intensity shows relative expression (z-scored)
  • Clustered ordering groups similar genes/tissues together
  • Use pan/zoom to explore regions of interest
  • Hover for detailed gene/tissue/value information

💡 Tip: Use pan/zoom to explore. Plotly automatically uses WebGL for efficient rendering of large matrices.

Understanding Model Predictions

Scientific Interpretation of Expression Outputs

Expression predictions represent the model's learned associations between variant profiles and tissue-specific gene regulation. Key interpretation principles:

What the Values Represent

  • Expression values are log2-scale RNA abundance predictions trained on GTEx data
  • Higher values indicate stronger predicted transcriptional activity in that tissue context
  • Predictions are individual-level (conditioned on variant genotype), not population averages
  • Values should be interpreted relative to the gene's typical expression range across tissues

Scientific Context

VariantFormer learns variant-to-expression mappings from examples:

  • Regulatory variants can increase or decrease expression depending on their effect on TF binding, chromatin accessibility, or RNA processing
  • Tissue specificity arises from tissue-specific enhancers, epigenetic states, and regulatory factor expression
  • Compound effects: Multiple variants across a gene's regulatory landscape contribute additively or non-additively
  • Model validation: Predictions correlate with eQTL effect sizes in held-out populations

Next Steps

Research Applications:

  1. Multi-gene analysis - Analyze co-regulated gene sets or pathways
  2. Cohort comparison - Compare predictions across individuals with different variant profiles
  3. eQTL validation - Compare predictions to experimental eQTL effect estimates
  4. Downstream modeling - Use predictions as input for disease risk models or functional scoring

Acknowledgments & Citations

Anatomogram Visualizations

The anatomical diagrams used in this tutorial are derived from the Expression Atlas project and are licensed under Creative Commons Attribution 4.0 International License.

Citation: Moreno P, Fexova S, George N, et al. Expression Atlas update: gene and protein expression in multiple species. Nucleic Acids Research. 2022;50(D1):D129-D140. doi:10.1093/nar/gkab1030

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Source: Expression Atlas, EMBL-EBI

The anatomogram SVG assets have been integrated into the VariantFormer visualization framework to provide interactive tissue-specific expression mapping.

VariantFormer Model

Citation: Sayan Ghosal, Youssef Barhomi, Tejaswini Ganapathi, Amy Krystosik, Lakshmi Krishnan, Sashidhar Guntury, Donghui Li, Alzheimer's Disease Neuroimaging Initiative, Francesco Paolo Casale, and Theofanis Karaletsos. VariantFormer: A hierarchical transformer integrating DNA sequences with genetic variation and regulatory landscapes for personalized gene expression prediction (2025). bioRxiv. doi:10.1101/2025.10.31.685862

Training Data: GTEx v10 paired whole-genome sequencing and RNA-seq data

Additional Resources


Responsible Use Statement

This tool is provided exclusively for research and educational purposes. Important considerations:

Research Tool Disclaimer

  • VariantFormer is a research model, not a clinical diagnostic tool
  • Predictions should not be used for medical decision-making without appropriate validation
  • This tool does not provide medical advice, diagnosis, or treatment recommendations
  • Consult qualified healthcare professionals for any health-related questions

Scientific Limitations

  • Predictions are based on GTEx training data and may not generalize to all populations
  • Rare or novel variants may have uncertain predicted effects
  • Model does not account for environmental factors, epigenetic variation, or post-transcriptional regulation
  • Expression predictions are probabilistic and should be validated experimentally when possible

Data Privacy

  • VCF data is processed locally and is not uploaded to external servers
  • Users are responsible for ensuring compliance with relevant data governance policies
  • Handle genetic data according to institutional IRB protocols and privacy regulations

Acceptable Use

Please follow the CZI Acceptable Use Policy when using this tool. This tool is intended for:

  • Academic research and genomics education
  • Exploratory analysis of variant-to-expression relationships
  • Hypothesis generation for experimental validation

Not intended for:

  • Clinical diagnosis or treatment decisions
  • Direct-to-consumer genetic interpretation
  • Insurance or employment decisions

Thank you for using VCF2Expression with VariantFormer. We hope this research tool advances scientific understanding of variant regulatory effects and gene expression biology.

Associated Resources
Loading