Tutorial: VariantFormer
VCF2Expression: Individual-Level Gene Expression Prediction with VariantFormer
Estimated time to complete: 15-20 minutes | Model: VariantFormer (1.2B parameters)
Learning Goals
- Predict tissue-specific gene expression from individual genetic variants
- Understand how mutations affect gene regulation across 63 human tissues and cell lines
- Visualize sample-specific expression patterns on interactive anatomograms
- Interpret the biological impact of genetic variation using a state-of-the-art foundation model
Prerequisites
For this browser tutorial: No setup required - fully interactive playground experience.
To run on your own compute:
- Hardware: GPU with 40GB+ VRAM (NVIDIA H100 recommended) Inference timing: ~30 seconds per gene across 63 tissues
- Input Data: VCF file with genetic variants (GRCh38 reference genome)
- Model: Pre-trained VariantFormer checkpoint (14GB)
- Software: VariantFormer package with anatomogram visualization components
Setup
To run VariantFormer on your own compute, you'll need to set up the environment and install dependencies.
Complete setup instructions: VariantFormer GitHub Setup Guide
This includes:
- Installing the VariantFormer package and dependencies
- Downloading model checkpoints from public S3 bucket
- Setting up reference genome files (GRCh38)
The browser playground below doesn't require any setup.
Understanding VariantFormer
VariantFormer is a 1.2-billion-parameter transformer foundation model that predicts how an individual's unique combination of genetic variants affects gene expression across all major tissues of the human body. This tutorial demonstrates how VariantFormer can be used to analyze variant-to-expression relationships at the individual level—the first model capable of cross-gene, cross-tissue expression prediction from individual whole genomes.
Key Innovations
Mutation-Aware Architecture
- Two-stage hierarchical design processes both regulatory regions and gene sequences
- Pre-trained encoders capture variant effects on cis-regulatory elements (cCREs) and gene bodies
- 25-layer CRE modulator and 25-layer gene modulator transformer stacks model complex regulatory interactions
- Cross-attention mechanisms link distal regulatory elements to target genes
Tissue Specificity
- Tissue context embeddings enable predictions across 63 GTEx tissues and cell lines
- Captures tissue-specific regulatory effects (e.g., brain vs. liver expression differences)
- Trained on paired whole-genome sequencing and RNA-seq data from GTEx v8
Scientific Validation
- Model attention patterns correlate with experimental chromatin accessibility data
- Validated against independent eQTL effect sizes across diverse populations
- Predictions align with known genotype-phenotype associations in disease cohorts
Analysis Workflow
- Load VCF → Input genetic variants (SNPs, indels) in standard VCF format
- Select Genes & Tissues → Choose genes and tissues to analyze across organ systems
- VariantFormer Processing → Model predicts individual-level expression values
- Interactive Visualization → Explore results using anatomogram visualizations
Let's begin by loading a VCF file for analysis.
Running Inference on Your VCF
Load a VCF File
Note for this tutorial: This tutorial uses precomputed predictions for browser-based execution. The VCF loading step below is shown for educational purposes to understand the full workflow in a research setting with GPU access.
Below, we import our VCFProcessor class and specify the location of our VCF (Variant Call Format) file containing an individual's genetic variants.
from processors.vcfprocessor import VCFProcessor # VariantFormer inference engine
# Initialize processor for VariantFormer PCG model
vcf_processor = VCFProcessor(model_class='v4_pcg')
# Path to your VCF file (GRCh38)
vcf_path = 'path/to/your_sample.vcf.gz'Configure Analysis Parameters
Selecting Genes and Tissues
VariantFormer can analyze expression for any protein-coding gene across 63 GTEx tissues (55 tissues) and cell lines (8 lines). This comprehensive analysis reveals how genetic variants affect gene regulation differently across organ systems.
Why tissue-specific analysis matters:
- The same genetic variant can increase expression in one tissue while decreasing it in another
- Tissue-specific regulatory effects are critical for understanding genotype-phenotype relationships
- VariantFormer's tissue embeddings capture context-dependent gene regulation mechanisms
Tissue Coverage:
Analysis includes all major organ systems: 16 brain regions, cardiovascular (heart, blood, arteries), digestive (liver, pancreas, stomach), respiratory (lung), urinary (kidney), musculoskeletal, endocrine, and immune tissues, plus 8 cancer cell lines commonly used in genomics research.
Explore Available Tissues and Genes
Before querying data, let's explore what tissues and genes are available in the precomputed dataset. This will help us understand the scope of available predictions and choose appropriate targets for our analysis.
# Get available genes and tissues from VariantFormer
genes_df = vcf_processor.get_genes() # ~18,439 protein-coding genes
tissues = vcf_processor.get_tissues() # 63 tissues (55 GTEx + 8 cell lines)
# Select genes for analysis
selected_genes = ['ENSG00000130203'] # APOE gene ID example
selected_tissues = list(tissues) # All 63 tissuesPrepare Query Data
Now we'll prepare our query data, which specifies:
- gene_id: Which gene we want to retrieve predictions for
- tissues: Which tissues/cell types we're interested in For comprehensive anatomogram visualization, we'll query multiple tissues.
# Create query DataFrame
query_df = pd.DataFrame({
'gene_id': selected_genes,
'tissues': [','.join(selected_tissues)]
})
# Create PyTorch dataset and dataloader for VariantFormer
vcf_dataset, dataloader = vcf_processor.create_data(vcf_path, query_df)
print(f"Dataset: {len(vcf_dataset)} samples, {len(dataloader)} batches")Running VariantFormer Predictions
In research environments with GPU infrastructure, VariantFormer processes the individual's genetic variants through its neural architecture:
1. Model Loading
- Loading 1.2 billion parameters including tissue-specific embedding modules
- Initializing mutation-aware encoders and 50-layer transformer architecture
- Setting up GPU acceleration for efficient inference
2. Variant Processing
- Parsing genetic variants from VCF (SNPs, indels)
- Mapping variants to cis-regulatory elements and gene bodies
- Tokenizing sequences with BPE vocabulary
3. Expression Prediction
- CRE modulator analyzes regulatory perturbations
- Gene modulator integrates sequence context
- Tissue-specific embeddings produce predictions
4. Output Generation
- Individual-level expression values per gene-tissue pair
- Results formatted for visualization
# Load VariantFormer model (~2-3 minutes, one-time)
model, checkpoint_path, trainer = vcf_processor.load_model()
# Run predictions (~30 seconds per gene on H100)
expression_predictions = vcf_processor.predict(
model, checkpoint_path, trainer, dataloader, vcf_dataset
)
# Results DataFrame: predicted expression per gene-tissue pair
print(f"Predictions shape: {expression_predictions.shape}")
print(expression_predictions.head())The playground below uses precomputed predictions for instant results:
Analyze Results
Let's examine the retrieved prediction results in detail. The output contains:
- Original query information (gene_id, tissue_name, tissue_id)
- predicted_expression: Precomputed gene expression level predictions
- embeddings: High-dimensional representations capturing regulatory context
We'll explore the structure of the results and provide some basic analysis.
Interactive Expression Visualization
Understanding Individual-Level Expression Predictions
The anatomogram displays predicted gene expression levels across human tissues conditioned on the individual's variant profile. This visualization shows how variant-specific regulatory effects manifest across different tissue contexts.
How to Interpret the Results:
Color Intensity
- Warmer colors (red/yellow) = Higher predicted expression
- Cooler colors (blue/purple) = Lower predicted expression
- Gray = No data or very low expression
- Values represent log2-transformed RNA abundance predictions
Interactive Features
- Hover over tissues to see expression values and tissue annotations
- Click to view UBERON tissue ontology details
- Switch tabs for male, female, and brain-specific anatomical views
- Enhanced tooltips provide tissue system classifications
Scientific Interpretation
- Expression levels reflect the combined regulatory effects of all variants affecting the gene
- Tissue-specific patterns reveal differential regulatory logic across cell types
- Cross-tissue comparison identifies genes with ubiquitous vs. restricted expression
- Results can be compared to GTEx population distributions to assess variant rarity
Note: These are model predictions conditioned on the individual's variant genotype—they represent the model's learned mapping from variant profiles to expression phenotypes based on GTEx training data.
# Initialize converter for anatomogram visualization
enhanced_converter = EnhancedVCFExpressionConverter(
aggregation_strategy='mean'
)
# Convert predictions to anatomogram format
anatomagram_data, enhanced_metadata = enhanced_converter.convert_predictions_to_anatomogram(
expression_predictions,
gene_name='APOE'
)
# Create interactive multi-view widget
multi_widget = AnatomagramMultiViewWidget(
visualization_data=anatomagram_data,
selected_item='APOE',
available_views=["male", "female", "brain"],
color_palette='viridis',
scale_type='linear',
uberon_names=enhanced_metadata['uberon_names'],
enhanced_tooltips=enhanced_metadata['enhanced_tooltips']
)
# Display widget (works in Jupyter, Marimo, or Streamlit)
multi_widgetGenome-Wide Expression Analysis
Tissue Hierarchical Clustering Dendrogram
Understanding Tissue Relationships
The dendrogram below shows hierarchical clustering of tissues based on their gene expression correlation patterns. Tissues with similar predicted expression profiles cluster together, revealing biological relationships across organ systems.
Interpretation:
- Vertical distance: Height indicates dissimilarity between clusters (larger = more different)
- Branch structure: Tissues that merge at lower heights have more similar expression profiles
- Biological insight: Clustering often groups tissues by developmental origin or physiological function
Expected patterns:
- Brain regions cluster together (shared neural regulatory programs)
- Metabolically active tissues group by function (liver, muscle, adipose)
- Hormone-responsive tissues may cluster (breast, ovary, prostate)
Interactive Heatmap Visualization
This section creates an interactive clustered heatmap showing expression patterns across ~18,000 genes and 63 tissues. The heatmap uses hierarchical clustering to reveal biological relationships between genes and tissues.
Key Features:
- Full Matrix: ~18k genes × 63 tissues (~1.1M cells)
- WebGL Rendering: Efficient pan/zoom for large matrices
- Hierarchical Clustering: Correlation distance + Ward linkage
- Transformations: log1p, z-score normalization, outlier clipping
- Drill-Down: Function to explore specific gene subsets
How to Interpret:
- Color intensity shows relative expression (z-scored)
- Clustered ordering groups similar genes/tissues together
- Use pan/zoom to explore regions of interest
- Hover for detailed gene/tissue/value information
💡 Tip: Use pan/zoom to explore. Plotly automatically uses WebGL for efficient rendering of large matrices.
Understanding Model Predictions
Scientific Interpretation of Expression Outputs
Expression predictions represent the model's learned associations between variant profiles and tissue-specific gene regulation. Key interpretation principles:
What the Values Represent
- Expression values are log2-scale RNA abundance predictions trained on GTEx data
- Higher values indicate stronger predicted transcriptional activity in that tissue context
- Predictions are individual-level (conditioned on variant genotype), not population averages
- Values should be interpreted relative to the gene's typical expression range across tissues
Scientific Context
VariantFormer learns variant-to-expression mappings from examples:
- Regulatory variants can increase or decrease expression depending on their effect on TF binding, chromatin accessibility, or RNA processing
- Tissue specificity arises from tissue-specific enhancers, epigenetic states, and regulatory factor expression
- Compound effects: Multiple variants across a gene's regulatory landscape contribute additively or non-additively
- Model validation: Predictions correlate with eQTL effect sizes in held-out populations
Next Steps
Research Applications:
- Multi-gene analysis - Analyze co-regulated gene sets or pathways
- Cohort comparison - Compare predictions across individuals with different variant profiles
- eQTL validation - Compare predictions to experimental eQTL effect estimates
- Downstream modeling - Use predictions as input for disease risk models or functional scoring
Acknowledgments & Citations
Anatomogram Visualizations
The anatomical diagrams used in this tutorial are derived from the Expression Atlas project and are licensed under Creative Commons Attribution 4.0 International License.
Citation: Moreno P, Fexova S, George N, et al. Expression Atlas update: gene and protein expression in multiple species. Nucleic Acids Research. 2022;50(D1):D129-D140. doi:10.1093/nar/gkab1030
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Source: Expression Atlas, EMBL-EBI
The anatomogram SVG assets have been integrated into the VariantFormer visualization framework to provide interactive tissue-specific expression mapping.
VariantFormer Model
Citation: Sayan Ghosal, Youssef Barhomi, Tejaswini Ganapathi, Amy Krystosik, Lakshmi Krishnan, Sashidhar Guntury, Donghui Li, Alzheimer's Disease Neuroimaging Initiative, Francesco Paolo Casale, and Theofanis Karaletsos. VariantFormer: A hierarchical transformer integrating DNA sequences with genetic variation and regulatory landscapes for personalized gene expression prediction (2025). bioRxiv. doi:10.1101/2025.10.31.685862
Training Data: GTEx v10 paired whole-genome sequencing and RNA-seq data
Additional Resources
- VariantFormer GitHub Repository
- GTEx Portal - Population expression data
- gnomAD - Population variant frequencies
Responsible Use Statement
This tool is provided exclusively for research and educational purposes. Important considerations:
Research Tool Disclaimer
- VariantFormer is a research model, not a clinical diagnostic tool
- Predictions should not be used for medical decision-making without appropriate validation
- This tool does not provide medical advice, diagnosis, or treatment recommendations
- Consult qualified healthcare professionals for any health-related questions
Scientific Limitations
- Predictions are based on GTEx training data and may not generalize to all populations
- Rare or novel variants may have uncertain predicted effects
- Model does not account for environmental factors, epigenetic variation, or post-transcriptional regulation
- Expression predictions are probabilistic and should be validated experimentally when possible
Data Privacy
- VCF data is processed locally and is not uploaded to external servers
- Users are responsible for ensuring compliance with relevant data governance policies
- Handle genetic data according to institutional IRB protocols and privacy regulations
Acceptable Use
Please follow the CZI Acceptable Use Policy when using this tool. This tool is intended for:
- Academic research and genomics education
- Exploratory analysis of variant-to-expression relationships
- Hypothesis generation for experimental validation
Not intended for:
- Clinical diagnosis or treatment decisions
- Direct-to-consumer genetic interpretation
- Insurance or employment decisions
Thank you for using VCF2Expression with VariantFormer. We hope this research tool advances scientific understanding of variant regulatory effects and gene expression biology.