Quickstart: VariantFormer
VCF2Risk: Alzheimer's Disease Risk Prediction
Estimated time to complete: ~10 minutes
Learning Goals
- Learn how to predict tissue-specific Alzheimer's disease risk from genetic variants
- Understand the VCF2Risk pipeline: variants → expression → embeddings → disease risk
- Explore gene-specific AD risk patterns across different tissues using interactive visualizations
- Interpret AD risk scores and expression predictions in biological context
Prerequisites
For this browser quickstart: No setup required - fully interactive playground experience.
To run on your own compute:
- Hardware: GPU with 40GB+ VRAM (NVIDIA H100 recommended) Processing time: ~3-4 minutes for 45 tissues
- Input Data: VCF file with genetic variants (GRCh38 reference genome)
- Model: Pre-trained VariantFormer checkpoint (14GB) + AD risk predictors
- Software: VariantFormer package with AD risk prediction components
Setup
To run VCF2Risk on your own compute, you'll need to set up the environment and install dependencies.
Complete setup instructions: VariantFormer GitHub Setup Guide
This includes:
- Installing the VariantFormer package and dependencies
- Downloading model checkpoints and AD risk predictors from public S3 bucket
- Setting up reference genome files (GRCh38)
The browser playground below doesn't require any setup.
Introduction
VCF2Risk predicts how genetic variants in a specific gene contribute to Alzheimer's disease risk across different tissues.
Model Architecture
The pipeline combines two AI components:
1. VariantFormer Model (Seq2Gene + Seq2Reg transformers):
- Input: DNA sequence with variants from VCF file
- Output: Tissue-specific gene expression predictions + 1536-dimensional embeddings
- Purpose: Captures how genetic variants affect gene regulation in each tissue
- Size: 14GB checkpoint, ~1.2B parameters
2. AD Risk Predictors (Gradient-boosted decision trees):
- Input: Gene-tissue embeddings from DNA2Cell model
- Output: Alzheimer's disease risk probability (0-1 scale)
- Training: Separate models for each gene-tissue pair (~16,400 genes × 45 tissues)
- Format: Treelite
.tlmodel files stored in S3
Pipeline Flow
VCF Variants → VariantFormer Model → [Expression + Embedding] → AD Predictor → Risk Score
↑ intermediate ↑ primary outputInput Data Requirements
VCF File:
- Standard VCF format (v4.2 or later)
- Reference genome: GRCh38/hg38 (critical - must match training data)
- Can be bgzipped (.vcf.gz) or uncompressed
- Must contain variants for the selected gene region
Gene Selection:
- Choose one gene per analysis
- Only genes with trained AD predictors available (~16,400 genes)
- Dropdown auto-filters to available genes
Tissue Selection:
- 45 out of 63 GTEx tissues have AD risk models
- Can analyze all tissues or focus on specific organ systems
- Default: All 45 tissues for comprehensive analysis
Expected Outputs
For each gene-tissue combination:
-
Predicted Expression (intermediate output):
- How variants alter gene expression in that tissue
- Log-scale expression values
- Provides biological context for risk scores
-
AD Risk Score (primary output):
- Probability (0-1) that gene contributes to AD in this tissue
- Trained from AD case-control gene expression datasets
- Higher scores = greater predicted disease contribution
- Tissue-specific: same gene can have different risk across tissues
The playground below uses precomputed predictions for ouput exploration:
Select Gene for Analysis
Choose one gene to analyze for AD risk contribution. The dropdown shows only genes that have trained AD risk predictors available.
Recommended genes for Alzheimer's disease analysis:
- APOE (Apolipoprotein E): Strongest genetic risk factor for late-onset AD
- APP (Amyloid Precursor Protein): Mutations cause early-onset familial AD
- PSEN1 (Presenilin 1): Familial AD gene, affects amyloid processing
- PSEN2 (Presenilin 2): Another familial AD gene
- MAPT (Microtubule Associated Protein Tau): Associated with tauopathies
- TREM2 (Triggering Receptor on Myeloid Cells 2): Immune gene linked to AD
Why gene-specific predictors?
Each AD risk predictor is trained for a specific gene-tissue combination, learning how that gene's regulatory patterns (captured in the embedding) relate to AD pathology in that particular tissue context.
import pandas as pd
from processors import ad_risk # AD risk prediction engine
# Initialize VariantFormer AD risk predictor
adrisk = ad_risk.ADriskFromVCF()
# VCF path (GRCh38 reference genome required)
vcf_path = 'path/to/your_sample.vcf.gz'
# Get available genes with AD predictors
genes_df = adrisk.genes_map.reset_index()
available_ad_genes = adrisk.ad_preds.get_unique('gene_id')
genes_with_ad = genes_df[genes_df['gene_id'].isin(available_ad_genes)]
# Select gene for analysis
selected_gene_id = 'ENSG00000130203' # APOE example
print(f"{len(genes_with_ad)} genes with AD predictors available")Select Tissues for Analysis
Choose which tissues to analyze for AD risk. By default, all 45 tissues with trained AD risk predictors are selected for comprehensive analysis.
Tissue Coverage:
- 45 out of 63 GTEx tissues have AD risk models trained
- Includes major organ systems: nervous, cardiovascular, digestive, respiratory, etc.
- 13 brain regions available for CNS-focused analysis
Analysis Strategies:
- Comprehensive (default): All 45 tissues to see complete risk landscape
- Brain-focused: Select only CNS tissues for neurological analysis
- Comparative: Choose a few key tissues for targeted comparison
- System-specific: Focus on one organ system (e.g., cardiovascular)
# Get tissues with AD predictors
available_tissue_ids = adrisk.ad_preds.get_unique('tissue_id')
tissue_ids = list(available_tissue_ids) # All 45 tissues
print(f"{len(tissue_ids)} tissues with AD predictors")Run AD Risk Predictions
The prediction pipeline executes: VCF parsing → VariantFormer inference → AD risk computation from embeddings.
# Run AD risk prediction pipeline
# Steps:
# 1. Load VCF variants for the selected gene region
# 2. Predict gene expression across tissues using VariantFormer
# 3. Generate 1536-dim embeddings (regulatory state representations)
# 4. Download AD predictors from S3 (one model per tissue)
# 5. Compute AD risk scores for each gene-tissue combination
predictions_df = adrisk(vcf_path, [selected_gene_id] * len(tissue_ids), tissue_ids)
# Results DataFrame with columns:
# - gene_name, gene_id, tissue_name, tissue_id
# - predicted_expression (intermediate output)
# - ad_risk (primary output, 0-1 scale)
print(f"Predictions: {predictions_df.shape}")
print(f"Mean AD risk: {predictions_df['ad_risk'].mean():.4f}")Understanding the Results
Each row represents predictions for one tissue. The table shows both intermediate and final outputs from the pipeline.
How to interpret:
-
AD Risk Score (0-1):
- 0.0: Low predicted risk
- 1.0: High predicted risk
-
Predicted Expression (context):
- Shows whether variants increase or decrease gene activity
- Helps explain why risk might be high (e.g., overexpression of risk gene)
- Intermediate output from VariantFormer model
-
Tissue Specificity:
- Same gene can have different risk scores across tissues
- Reflects tissue-specific biology and disease mechanisms
- Brain tissues might show distinct patterns for neurological disease genes
Visualize Risk Distribution
This bar chart displays AD risk scores for all analyzed tissues, sorted and color-coded by risk level.
import plotly.express as px
fig = px.bar(
predictions_df,
x='tissue_name',
y='ad_risk',
title=f'AD Risk: {predictions_df.iloc[0]["gene_name"]} across Tissues',
color='ad_risk',
color_continuous_scale='viridis',
labels={'ad_risk': 'AD Risk Score', 'tissue_name': 'Tissue'}
)
fig.update_xaxes(tickangle=45)
fig.update_layout(height=500)
fig.show()What to look for:
- High-risk tissues: Darker colors (yellow), taller bars
- Tissue patterns: Do certain organ systems cluster together in risk?
- Outliers: Tissues with unusually high or low risk compared to others
- Brain regions: For AD genes, often show elevated risk in CNS tissues
Anatomical Risk Mapping
The anatomogram displays AD risk scores spatially mapped onto human body diagrams, providing intuitive visualization of tissue-specific disease risk patterns.
# Convert to anatomogram format
enhanced_converter = EnhancedVCFRiskConverter(aggregation_strategy='mean')
anatomagram_data, enhanced_metadata = \
enhanced_converter.convert_predictions_to_anatomagram(predictions_df)
# Create multi-view widget
multi_widget = AnatomagramMultiViewWidget(
visualization_data=anatomagram_data,
selected_item="AD_RISK",
available_views=["male", "female", "brain"],
color_palette="viridis",
scale_type="linear",
uberon_names=enhanced_metadata['uberon_names'],
enhanced_tooltips=enhanced_metadata['enhanced_tooltips']
)
# Display widget (works in Jupyter, Marimo, or Streamlit)
multi_widgetFeatures:
- Three anatomical views: Male, female, and brain-focused anatomies
- Color-coded risk levels: Viridis palette (purple = low risk, yellow = high risk)
- Interactive tooltips: Hover over colored regions for detailed information
- Hierarchical mapping: Related tissues intelligently aggregated to anatomical structures
How to use:
- Switch between tabs to see different anatomical perspectives
- Hover over tissues to see exact risk values and tissue names
- Compare patterns across different body systems visually
- Identify risk hotspots where disease contribution is concentrated
Summary Statistics
View the tissues with highest and lowest predicted AD risk for your selected gene.
Interpreting Your Results
What do AD risk scores mean?
The risk scores (0-1) represent the predicted probability that this gene's regulatory state contributes to Alzheimer's disease in each tissue, based on:
- Expression patterns learned from AD case-control cohorts
- Regulatory signatures captured in gene-tissue embeddings
- Variant effects on gene expression in your VCF file
Clinical Context
These are research predictions, not clinical diagnoses. They indicate:
- Genes and tissues where variants may influence AD biology
- Tissue-specific mechanisms of genetic risk
- Hypotheses for follow-up experimental validation
Limitations
- Predictions based on population-level training data
- Individual AD risk depends on many factors beyond single genes
- Some tissues may lack sufficient AD training data
- Scores reflect correlation, not necessarily causation
- Model does not account for environmental factors, epigenetics, or post-transcriptional regulation
Next Steps
Analyze Your Own Data
To run VCF2Risk on your own genetic data:
- Prepare VCF file: Ensure it uses GRCh38 reference genome
- Update VCF path: Edit the
vcf_pathvariable in example notebooks - Select gene: Choose gene(s) of interest from available genes with AD predictors
- Select tissues: Choose relevant tissues for your research question
- Export results: Save predictions with
predictions_df.to_csv('my_results.csv')
Further Exploration
Comparative analysis:
- Run analysis multiple times with different AD-associated genes (APOE, APP, PSEN1, etc.)
- Compare risk patterns across genes to identify common vs. gene-specific tissue effects
Focused analysis:
- Select only brain tissues for CNS-specific AD mechanisms
- Focus on peripheral tissues to explore systemic disease contributions
References
Anatomogram Visualizations
The anatomical diagrams used in this quickstart are derived from the Expression Atlas project and are licensed under Creative Commons Attribution 4.0 International License.
Citation:
Moreno P, Fexova S, George N, et al. Expression Atlas update: gene and protein expression in multiple species. Nucleic Acids Research. 2022;50(D1):D129-D140. doi:10.1093/nar/gkab1030
Source: Expression Atlas, EMBL-EBI
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
VariantFormer Model
Citation:
Ghosal, S., et al. (2025). VariantFormer: A hierarchical transformer integrating DNA sequences with genetic variation and regulatory landscapes for personalized gene expression prediction. bioRxiv 2025.10.31.685862. DOI: 10.1101/2025.10.31.685862
Training Data
- GTEx v8: Tissue-specific gene expression reference data
- AD cohort datasets: Case-control data for risk predictor training
Additional Resources
- VariantFormer GitHub Repository
- GTEx Portal - Population gene expression data
- gnomAD - Population variant frequencies
Responsible Use
This tool is for research purposes only.
Research Tool Disclaimer
- VCF2Risk is a research model, not a clinical diagnostic tool
- Predictions should not be used for medical decision-making without appropriate validation
- This tool does not provide medical advice, diagnosis, or treatment recommendations
- Consult qualified healthcare professionals for any health-related questions
Scientific Limitations
- Predictions are based on GTEx and AD cohort training data - may not generalize to all populations
- Rare or novel variants may have uncertain predicted effects
- Model does not account for environmental factors, epigenetic variation, or post-transcriptional regulation
- AD risk scores are probabilistic and should be validated experimentally when possible
Data Privacy
- VCF data is processed locally and not uploaded to external servers
- Users are responsible for ensuring compliance with relevant data governance policies
- Handle genetic data according to institutional IRB protocols and privacy regulations
Acceptable Use
Follow the Acceptable Use Policy.
This tool is intended for:
- Academic research and genomics education
- Exploratory analysis of variant-to-disease relationships
- Hypothesis generation for experimental validation
- Understanding tissue-specific AD mechanisms
Not intended for:
- Clinical diagnosis or treatment decisions
- Direct-to-consumer genetic interpretation
- Medical advice or health recommendations