Try Models

Quickstart: VariantFormer

VCF2Risk: Alzheimer's Disease Risk Prediction

Estimated time to complete: ~10 minutes

Learning Goals

  • Learn how to predict tissue-specific Alzheimer's disease risk from genetic variants
  • Understand the VCF2Risk pipeline: variants → expression → embeddings → disease risk
  • Explore gene-specific AD risk patterns across different tissues using interactive visualizations
  • Interpret AD risk scores and expression predictions in biological context

Prerequisites

For this browser quickstart: No setup required - fully interactive playground experience.

To run on your own compute:

  • Hardware: GPU with 40GB+ VRAM (NVIDIA H100 recommended) Processing time: ~3-4 minutes for 45 tissues
  • Input Data: VCF file with genetic variants (GRCh38 reference genome)
  • Model: Pre-trained VariantFormer checkpoint (14GB) + AD risk predictors
  • Software: VariantFormer package with AD risk prediction components

Setup

To run VCF2Risk on your own compute, you'll need to set up the environment and install dependencies.

Complete setup instructions: VariantFormer GitHub Setup Guide

This includes:

  • Installing the VariantFormer package and dependencies
  • Downloading model checkpoints and AD risk predictors from public S3 bucket
  • Setting up reference genome files (GRCh38)

The browser playground below doesn't require any setup.

Introduction

VCF2Risk predicts how genetic variants in a specific gene contribute to Alzheimer's disease risk across different tissues.

Model Architecture

The pipeline combines two AI components:

1. VariantFormer Model (Seq2Gene + Seq2Reg transformers):

  • Input: DNA sequence with variants from VCF file
  • Output: Tissue-specific gene expression predictions + 1536-dimensional embeddings
  • Purpose: Captures how genetic variants affect gene regulation in each tissue
  • Size: 14GB checkpoint, ~1.2B parameters

2. AD Risk Predictors (Gradient-boosted decision trees):

  • Input: Gene-tissue embeddings from DNA2Cell model
  • Output: Alzheimer's disease risk probability (0-1 scale)
  • Training: Separate models for each gene-tissue pair (~16,400 genes × 45 tissues)
  • Format: Treelite .tl model files stored in S3

Pipeline Flow

VCF Variants → VariantFormer Model → [Expression + Embedding] → AD Predictor → Risk Score
                                      ↑ intermediate            ↑ primary output

Input Data Requirements

VCF File:

  • Standard VCF format (v4.2 or later)
  • Reference genome: GRCh38/hg38 (critical - must match training data)
  • Can be bgzipped (.vcf.gz) or uncompressed
  • Must contain variants for the selected gene region

Gene Selection:

  • Choose one gene per analysis
  • Only genes with trained AD predictors available (~16,400 genes)
  • Dropdown auto-filters to available genes

Tissue Selection:

  • 45 out of 63 GTEx tissues have AD risk models
  • Can analyze all tissues or focus on specific organ systems
  • Default: All 45 tissues for comprehensive analysis

Expected Outputs

For each gene-tissue combination:

  1. Predicted Expression (intermediate output):

    • How variants alter gene expression in that tissue
    • Log-scale expression values
    • Provides biological context for risk scores
  2. AD Risk Score (primary output):

    • Probability (0-1) that gene contributes to AD in this tissue
    • Trained from AD case-control gene expression datasets
    • Higher scores = greater predicted disease contribution
    • Tissue-specific: same gene can have different risk across tissues

The playground below uses precomputed predictions for ouput exploration:

Select Gene for Analysis

Choose one gene to analyze for AD risk contribution. The dropdown shows only genes that have trained AD risk predictors available.

Recommended genes for Alzheimer's disease analysis:

  • APOE (Apolipoprotein E): Strongest genetic risk factor for late-onset AD
  • APP (Amyloid Precursor Protein): Mutations cause early-onset familial AD
  • PSEN1 (Presenilin 1): Familial AD gene, affects amyloid processing
  • PSEN2 (Presenilin 2): Another familial AD gene
  • MAPT (Microtubule Associated Protein Tau): Associated with tauopathies
  • TREM2 (Triggering Receptor on Myeloid Cells 2): Immune gene linked to AD

Why gene-specific predictors?

Each AD risk predictor is trained for a specific gene-tissue combination, learning how that gene's regulatory patterns (captured in the embedding) relate to AD pathology in that particular tissue context.

import pandas as pd
from processors import ad_risk  # AD risk prediction engine

# Initialize VariantFormer AD risk predictor
adrisk = ad_risk.ADriskFromVCF()

# VCF path (GRCh38 reference genome required)
vcf_path = 'path/to/your_sample.vcf.gz'

# Get available genes with AD predictors
genes_df = adrisk.genes_map.reset_index()
available_ad_genes = adrisk.ad_preds.get_unique('gene_id')
genes_with_ad = genes_df[genes_df['gene_id'].isin(available_ad_genes)]

# Select gene for analysis
selected_gene_id = 'ENSG00000130203'  # APOE example

print(f"{len(genes_with_ad)} genes with AD predictors available")

Select Tissues for Analysis

Choose which tissues to analyze for AD risk. By default, all 45 tissues with trained AD risk predictors are selected for comprehensive analysis.

Tissue Coverage:

  • 45 out of 63 GTEx tissues have AD risk models trained
  • Includes major organ systems: nervous, cardiovascular, digestive, respiratory, etc.
  • 13 brain regions available for CNS-focused analysis

Analysis Strategies:

  • Comprehensive (default): All 45 tissues to see complete risk landscape
  • Brain-focused: Select only CNS tissues for neurological analysis
  • Comparative: Choose a few key tissues for targeted comparison
  • System-specific: Focus on one organ system (e.g., cardiovascular)
# Get tissues with AD predictors
available_tissue_ids = adrisk.ad_preds.get_unique('tissue_id')
tissue_ids = list(available_tissue_ids)  # All 45 tissues

print(f"{len(tissue_ids)} tissues with AD predictors")

Run AD Risk Predictions

The prediction pipeline executes: VCF parsing → VariantFormer inference → AD risk computation from embeddings.

# Run AD risk prediction pipeline
# Steps:
#  1. Load VCF variants for the selected gene region
#  2. Predict gene expression across tissues using VariantFormer
#  3. Generate 1536-dim embeddings (regulatory state representations)
#  4. Download AD predictors from S3 (one model per tissue)
#  5. Compute AD risk scores for each gene-tissue combination

predictions_df = adrisk(vcf_path, [selected_gene_id] * len(tissue_ids), tissue_ids)

# Results DataFrame with columns:
# - gene_name, gene_id, tissue_name, tissue_id
# - predicted_expression (intermediate output)
# - ad_risk (primary output, 0-1 scale)

print(f"Predictions: {predictions_df.shape}")
print(f"Mean AD risk: {predictions_df['ad_risk'].mean():.4f}")

Understanding the Results

Each row represents predictions for one tissue. The table shows both intermediate and final outputs from the pipeline.

How to interpret:

  • AD Risk Score (0-1):

    • 0.0: Low predicted risk
    • 1.0: High predicted risk
  • Predicted Expression (context):

    • Shows whether variants increase or decrease gene activity
    • Helps explain why risk might be high (e.g., overexpression of risk gene)
    • Intermediate output from VariantFormer model
  • Tissue Specificity:

    • Same gene can have different risk scores across tissues
    • Reflects tissue-specific biology and disease mechanisms
    • Brain tissues might show distinct patterns for neurological disease genes

Visualize Risk Distribution

This bar chart displays AD risk scores for all analyzed tissues, sorted and color-coded by risk level.

import plotly.express as px

fig = px.bar(
    predictions_df,
    x='tissue_name',
    y='ad_risk',
    title=f'AD Risk: {predictions_df.iloc[0]["gene_name"]} across Tissues',
    color='ad_risk',
    color_continuous_scale='viridis',
    labels={'ad_risk': 'AD Risk Score', 'tissue_name': 'Tissue'}
)
fig.update_xaxes(tickangle=45)
fig.update_layout(height=500)
fig.show()

What to look for:

  • High-risk tissues: Darker colors (yellow), taller bars
  • Tissue patterns: Do certain organ systems cluster together in risk?
  • Outliers: Tissues with unusually high or low risk compared to others
  • Brain regions: For AD genes, often show elevated risk in CNS tissues

Anatomical Risk Mapping

The anatomogram displays AD risk scores spatially mapped onto human body diagrams, providing intuitive visualization of tissue-specific disease risk patterns.

# Convert to anatomogram format
enhanced_converter = EnhancedVCFRiskConverter(aggregation_strategy='mean')
anatomagram_data, enhanced_metadata = \
    enhanced_converter.convert_predictions_to_anatomagram(predictions_df)

# Create multi-view widget
multi_widget = AnatomagramMultiViewWidget(
    visualization_data=anatomagram_data,
    selected_item="AD_RISK",
    available_views=["male", "female", "brain"],
    color_palette="viridis",
    scale_type="linear",
    uberon_names=enhanced_metadata['uberon_names'],
    enhanced_tooltips=enhanced_metadata['enhanced_tooltips']
)

# Display widget (works in Jupyter, Marimo, or Streamlit)
multi_widget

Features:

  • Three anatomical views: Male, female, and brain-focused anatomies
  • Color-coded risk levels: Viridis palette (purple = low risk, yellow = high risk)
  • Interactive tooltips: Hover over colored regions for detailed information
  • Hierarchical mapping: Related tissues intelligently aggregated to anatomical structures

How to use:

  • Switch between tabs to see different anatomical perspectives
  • Hover over tissues to see exact risk values and tissue names
  • Compare patterns across different body systems visually
  • Identify risk hotspots where disease contribution is concentrated

Summary Statistics

View the tissues with highest and lowest predicted AD risk for your selected gene.

Interpreting Your Results

What do AD risk scores mean?

The risk scores (0-1) represent the predicted probability that this gene's regulatory state contributes to Alzheimer's disease in each tissue, based on:

  • Expression patterns learned from AD case-control cohorts
  • Regulatory signatures captured in gene-tissue embeddings
  • Variant effects on gene expression in your VCF file

Clinical Context

These are research predictions, not clinical diagnoses. They indicate:

  • Genes and tissues where variants may influence AD biology
  • Tissue-specific mechanisms of genetic risk
  • Hypotheses for follow-up experimental validation

Limitations

  • Predictions based on population-level training data
  • Individual AD risk depends on many factors beyond single genes
  • Some tissues may lack sufficient AD training data
  • Scores reflect correlation, not necessarily causation
  • Model does not account for environmental factors, epigenetics, or post-transcriptional regulation

Next Steps

Analyze Your Own Data

To run VCF2Risk on your own genetic data:

  1. Prepare VCF file: Ensure it uses GRCh38 reference genome
  2. Update VCF path: Edit the vcf_path variable in example notebooks
  3. Select gene: Choose gene(s) of interest from available genes with AD predictors
  4. Select tissues: Choose relevant tissues for your research question
  5. Export results: Save predictions with predictions_df.to_csv('my_results.csv')

Further Exploration

Comparative analysis:

  • Run analysis multiple times with different AD-associated genes (APOE, APP, PSEN1, etc.)
  • Compare risk patterns across genes to identify common vs. gene-specific tissue effects

Focused analysis:

  • Select only brain tissues for CNS-specific AD mechanisms
  • Focus on peripheral tissues to explore systemic disease contributions

References

Anatomogram Visualizations

The anatomical diagrams used in this quickstart are derived from the Expression Atlas project and are licensed under Creative Commons Attribution 4.0 International License.

Citation:

Moreno P, Fexova S, George N, et al. Expression Atlas update: gene and protein expression in multiple species. Nucleic Acids Research. 2022;50(D1):D129-D140. doi:10.1093/nar/gkab1030

Source: Expression Atlas, EMBL-EBI

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

VariantFormer Model

Citation:

Ghosal, S., et al. (2025). VariantFormer: A hierarchical transformer integrating DNA sequences with genetic variation and regulatory landscapes for personalized gene expression prediction. bioRxiv 2025.10.31.685862. DOI: 10.1101/2025.10.31.685862

Training Data

  • GTEx v8: Tissue-specific gene expression reference data
  • AD cohort datasets: Case-control data for risk predictor training

Additional Resources

Responsible Use

This tool is for research purposes only.

Research Tool Disclaimer

  • VCF2Risk is a research model, not a clinical diagnostic tool
  • Predictions should not be used for medical decision-making without appropriate validation
  • This tool does not provide medical advice, diagnosis, or treatment recommendations
  • Consult qualified healthcare professionals for any health-related questions

Scientific Limitations

  • Predictions are based on GTEx and AD cohort training data - may not generalize to all populations
  • Rare or novel variants may have uncertain predicted effects
  • Model does not account for environmental factors, epigenetic variation, or post-transcriptional regulation
  • AD risk scores are probabilistic and should be validated experimentally when possible

Data Privacy

  • VCF data is processed locally and not uploaded to external servers
  • Users are responsible for ensuring compliance with relevant data governance policies
  • Handle genetic data according to institutional IRB protocols and privacy regulations

Acceptable Use

Follow the Acceptable Use Policy.

This tool is intended for:

  • Academic research and genomics education
  • Exploratory analysis of variant-to-disease relationships
  • Hypothesis generation for experimental validation
  • Understanding tissue-specific AD mechanisms

Not intended for:

  • Clinical diagnosis or treatment decisions
  • Direct-to-consumer genetic interpretation
  • Medical advice or health recommendations
Associated Resources
Loading