Quickstart: VariantFormer

VCF2Risk: Alzheimer's Disease Risk Prediction

Estimated time to complete: ~10 minutes

Learning Goals

Learn how to predict tissue-specific Alzheimer's disease risk from genetic variants
Understand the VCF2Risk pipeline: variants → expression → embeddings → disease risk
Explore gene-specific AD risk patterns across different tissues using interactive visualizations
Interpret AD risk scores and expression predictions in biological context

Prerequisites

For this browser quickstart: No setup required - fully interactive playground experience.

To run on your own compute:

Hardware: GPU with 40GB+ VRAM (NVIDIA H100 recommended) Processing time: ~3-4 minutes for 45 tissues
Input Data: VCF file with genetic variants (GRCh38 reference genome)
Model: Pre-trained VariantFormer checkpoint (14GB) + AD risk predictors
Software: VariantFormer package with AD risk prediction components

About This Quickstart

This is an interactive playground experience that runs entirely in your browser. The Marimo blocks below use precomputed predictions for demonstration purposes.

To run VCF2Risk on your own data with your own compute, use these notebooks:

Jupyter Notebook: vcf2risk.ipynb
Marimo Notebook: vcf2risk.py

The code examples below show you how to run real inference. The playground demonstrates results interactively.

Note: The playground shows 100 sample genes for demonstration. The full VariantFormer package supports ~16,400 genes with AD risk predictors.

Setup

To run VCF2Risk on your own compute, you'll need to set up the environment and install dependencies.

Complete setup instructions: VariantFormer GitHub Setup Guide

This includes:

Installing the VariantFormer package and dependencies
Downloading model checkpoints and AD risk predictors from public S3 bucket
Setting up reference genome files (GRCh38)

The browser playground below doesn't require any setup.

Introduction

VCF2Risk predicts how genetic variants in a specific gene contribute to Alzheimer's disease risk across different tissues.

Model Architecture

The pipeline combines two AI components:

1. VariantFormer Model (Seq2Gene + Seq2Reg transformers):

Input: DNA sequence with variants from VCF file
Output: Tissue-specific gene expression predictions + 1536-dimensional embeddings
Purpose: Captures how genetic variants affect gene regulation in each tissue
Size: 14GB checkpoint, ~1.2B parameters

2. AD Risk Predictors (Gradient-boosted decision trees):

Input: Gene-tissue embeddings from DNA2Cell model
Output: Alzheimer's disease risk probability (0-1 scale)
Training: Separate models for each gene-tissue pair (~16,400 genes × 45 tissues)
Format: Treelite .tl model files stored in S3

Pipeline Flow

VCF Variants → VariantFormer Model → [Expression + Embedding] → AD Predictor → Risk Score
                                      ↑ intermediate            ↑ primary output

Input Data Requirements

VCF File:

Standard VCF format (v4.2 or later)
Reference genome: GRCh38/hg38 (critical - must match training data)
Can be bgzipped (.vcf.gz) or uncompressed
Must contain variants for the selected gene region

Gene Selection:

Choose one gene per analysis
Only genes with trained AD predictors available (~16,400 genes)
Dropdown auto-filters to available genes

Tissue Selection:

45 out of 63 GTEx tissues have AD risk models
Can analyze all tissues or focus on specific organ systems
Default: All 45 tissues for comprehensive analysis

Expected Outputs

For each gene-tissue combination:

Predicted Expression (intermediate output):
- How variants alter gene expression in that tissue
- Log-scale expression values
- Provides biological context for risk scores
AD Risk Score (primary output):
- Probability (0-1) that gene contributes to AD in this tissue
- Trained from AD case-control gene expression datasets
- Higher scores = greater predicted disease contribution
- Tissue-specific: same gene can have different risk across tissues

The playground below uses precomputed predictions for ouput exploration:

Select Gene for Analysis

Choose one gene to analyze for AD risk contribution. The dropdown shows only genes that have trained AD risk predictors available.

Recommended genes for Alzheimer's disease analysis:

APOE (Apolipoprotein E): Strongest genetic risk factor for late-onset AD
APP (Amyloid Precursor Protein): Mutations cause early-onset familial AD
PSEN1 (Presenilin 1): Familial AD gene, affects amyloid processing
PSEN2 (Presenilin 2): Another familial AD gene
MAPT (Microtubule Associated Protein Tau): Associated with tauopathies
TREM2 (Triggering Receptor on Myeloid Cells 2): Immune gene linked to AD

Why gene-specific predictors?

Each AD risk predictor is trained for a specific gene-tissue combination, learning how that gene's regulatory patterns (captured in the embedding) relate to AD pathology in that particular tissue context.

import pandas as pd
from processors import ad_risk  # AD risk prediction engine

# Initialize VariantFormer AD risk predictor
adrisk = ad_risk.ADriskFromVCF()

# VCF path (GRCh38 reference genome required)
vcf_path = 'path/to/your_sample.vcf.gz'

# Get available genes with AD predictors
genes_df = adrisk.genes_map.reset_index()
available_ad_genes = adrisk.ad_preds.get_unique('gene_id')
genes_with_ad = genes_df[genes_df['gene_id'].isin(available_ad_genes)]

# Select gene for analysis
selected_gene_id = 'ENSG00000130203'  # APOE example

print(f"{len(genes_with_ad)} genes with AD predictors available")

Select Tissues for Analysis

Choose which tissues to analyze for AD risk. By default, all 45 tissues with trained AD risk predictors are selected for comprehensive analysis.

Tissue Coverage:

45 out of 63 GTEx tissues have AD risk models trained
Includes major organ systems: nervous, cardiovascular, digestive, respiratory, etc.
13 brain regions available for CNS-focused analysis

Analysis Strategies:

Comprehensive (default): All 45 tissues to see complete risk landscape
Brain-focused: Select only CNS tissues for neurological analysis
Comparative: Choose a few key tissues for targeted comparison
System-specific: Focus on one organ system (e.g., cardiovascular)

# Get tissues with AD predictors
available_tissue_ids = adrisk.ad_preds.get_unique('tissue_id')
tissue_ids = list(available_tissue_ids)  # All 45 tissues

print(f"{len(tissue_ids)} tissues with AD predictors")

Run AD Risk Predictions

The prediction pipeline executes: VCF parsing → VariantFormer inference → AD risk computation from embeddings.

# Run AD risk prediction pipeline
# Steps:
#  1. Load VCF variants for the selected gene region
#  2. Predict gene expression across tissues using VariantFormer
#  3. Generate 1536-dim embeddings (regulatory state representations)
#  4. Download AD predictors from S3 (one model per tissue)
#  5. Compute AD risk scores for each gene-tissue combination

predictions_df = adrisk(vcf_path, [selected_gene_id] * len(tissue_ids), tissue_ids)

# Results DataFrame with columns:
# - gene_name, gene_id, tissue_name, tissue_id
# - predicted_expression (intermediate output)
# - ad_risk (primary output, 0-1 scale)

print(f"Predictions: {predictions_df.shape}")
print(f"Mean AD risk: {predictions_df['ad_risk'].mean():.4f}")

Understanding the Results

Each row represents predictions for one tissue. The table shows both intermediate and final outputs from the pipeline.

How to interpret:

AD Risk Score (0-1):
- 0.0: Low predicted risk
- 1.0: High predicted risk
Predicted Expression (context):
- Shows whether variants increase or decrease gene activity
- Helps explain why risk might be high (e.g., overexpression of risk gene)
- Intermediate output from VariantFormer model
Tissue Specificity:
- Same gene can have different risk scores across tissues
- Reflects tissue-specific biology and disease mechanisms
- Brain tissues might show distinct patterns for neurological disease genes

Visualize Risk Distribution

This bar chart displays AD risk scores for all analyzed tissues, sorted and color-coded by risk level.

import plotly.express as px

fig = px.bar(
    predictions_df,
    x='tissue_name',
    y='ad_risk',
    title=f'AD Risk: {predictions_df.iloc[0]["gene_name"]} across Tissues',
    color='ad_risk',
    color_continuous_scale='viridis',
    labels={'ad_risk': 'AD Risk Score', 'tissue_name': 'Tissue'}
)
fig.update_xaxes(tickangle=45)
fig.update_layout(height=500)
fig.show()

What to look for:

High-risk tissues: Darker colors (yellow), taller bars
Tissue patterns: Do certain organ systems cluster together in risk?
Outliers: Tissues with unusually high or low risk compared to others
Brain regions: For AD genes, often show elevated risk in CNS tissues

Anatomical Risk Mapping

The anatomogram displays AD risk scores spatially mapped onto human body diagrams, providing intuitive visualization of tissue-specific disease risk patterns.

# Convert to anatomogram format
enhanced_converter = EnhancedVCFRiskConverter(aggregation_strategy='mean')
anatomagram_data, enhanced_metadata = \
    enhanced_converter.convert_predictions_to_anatomagram(predictions_df)

# Create multi-view widget
multi_widget = AnatomagramMultiViewWidget(
    visualization_data=anatomagram_data,
    selected_item="AD_RISK",
    available_views=["male", "female", "brain"],
    color_palette="viridis",
    scale_type="linear",
    uberon_names=enhanced_metadata['uberon_names'],
    enhanced_tooltips=enhanced_metadata['enhanced_tooltips']
)

# Display widget (works in Jupyter, Marimo, or Streamlit)
multi_widget

Features:

Three anatomical views: Male, female, and brain-focused anatomies
Color-coded risk levels: Viridis palette (purple = low risk, yellow = high risk)
Interactive tooltips: Hover over colored regions for detailed information
Hierarchical mapping: Related tissues intelligently aggregated to anatomical structures

How to use:

Switch between tabs to see different anatomical perspectives
Hover over tissues to see exact risk values and tissue names
Compare patterns across different body systems visually
Identify risk hotspots where disease contribution is concentrated

Summary Statistics

View the tissues with highest and lowest predicted AD risk for your selected gene.

Interpreting Your Results

What do AD risk scores mean?

The risk scores (0-1) represent the predicted probability that this gene's regulatory state contributes to Alzheimer's disease in each tissue, based on:

Expression patterns learned from AD case-control cohorts
Regulatory signatures captured in gene-tissue embeddings
Variant effects on gene expression in your VCF file

Clinical Context

These are research predictions, not clinical diagnoses. They indicate:

Genes and tissues where variants may influence AD biology
Tissue-specific mechanisms of genetic risk
Hypotheses for follow-up experimental validation

Limitations

Predictions based on population-level training data
Individual AD risk depends on many factors beyond single genes
Some tissues may lack sufficient AD training data
Scores reflect correlation, not necessarily causation
Model does not account for environmental factors, epigenetics, or post-transcriptional regulation

Next Steps

Analyze Your Own Data

To run VCF2Risk on your own genetic data:

Prepare VCF file: Ensure it uses GRCh38 reference genome
Update VCF path: Edit the vcf_path variable in example notebooks
Select gene: Choose gene(s) of interest from available genes with AD predictors
Select tissues: Choose relevant tissues for your research question
Export results: Save predictions with predictions_df.to_csv('my_results.csv')

Further Exploration

Comparative analysis:

Run analysis multiple times with different AD-associated genes (APOE, APP, PSEN1, etc.)
Compare risk patterns across genes to identify common vs. gene-specific tissue effects

Focused analysis:

Select only brain tissues for CNS-specific AD mechanisms
Focus on peripheral tissues to explore systemic disease contributions

References

Anatomogram Visualizations

The anatomical diagrams used in this quickstart are derived from the Expression Atlas project and are licensed under Creative Commons Attribution 4.0 International License.

Citation:

Moreno P, Fexova S, George N, et al. Expression Atlas update: gene and protein expression in multiple species. Nucleic Acids Research. 2022;50(D1):D129-D140. doi:10.1093/nar/gkab1030

Source: Expression Atlas, EMBL-EBI

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

VariantFormer Model

Citation:

Ghosal, S., et al. (2025). VariantFormer: A hierarchical transformer integrating DNA sequences with genetic variation and regulatory landscapes for personalized gene expression prediction. bioRxiv 2025.10.31.685862. DOI: 10.1101/2025.10.31.685862

Training Data

GTEx v8: Tissue-specific gene expression reference data
AD cohort datasets: Case-control data for risk predictor training

Additional Resources

VariantFormer GitHub Repository
GTEx Portal - Population gene expression data
gnomAD - Population variant frequencies

Responsible Use

This tool is for research purposes only.

Research Tool Disclaimer

VCF2Risk is a research model, not a clinical diagnostic tool
Predictions should not be used for medical decision-making without appropriate validation
This tool does not provide medical advice, diagnosis, or treatment recommendations
Consult qualified healthcare professionals for any health-related questions

Scientific Limitations

Predictions are based on GTEx and AD cohort training data - may not generalize to all populations
Rare or novel variants may have uncertain predicted effects
Model does not account for environmental factors, epigenetic variation, or post-transcriptional regulation
AD risk scores are probabilistic and should be validated experimentally when possible

Data Privacy

VCF data is processed locally and not uploaded to external servers
Users are responsible for ensuring compliance with relevant data governance policies
Handle genetic data according to institutional IRB protocols and privacy regulations

Acceptable Use

Follow the Acceptable Use Policy.

This tool is intended for:

Academic research and genomics education
Exploratory analysis of variant-to-disease relationships
Hypothesis generation for experimental validation
Understanding tissue-specific AD mechanisms

Not intended for:

Clinical diagnosis or treatment decisions
Direct-to-consumer genetic interpretation
Medical advice or health recommendations

Associated Resources