Try Models

VariantFormer

Version v0.1 released 02 Nov 2025

License

MIT

Developed By

  • VariantFormer Team
  • Chan Zuckerberg Initiative

A biology-guided, 1.2-billion-parameter transformer that predicts gene-level RNA abundance across diverse tissues and cell lines from personalized DNA sequences. Trained jointly on the largest curated collection of paired whole-genome sequencing and bulk RNA-seq samples to date from GTEx, MAGE, ADNI, and ENCODE datasets.

Associated Resources
Loading

Model Details

Model Architecture

VariantFormer employs a biology-guided, two-stage hierarchical transformer architecture specifically designed for personal genome-aware gene expression prediction. The model integrates long-range cis-regulatory elements (cCREs) with gene-specific transcription windows to capture both distal regulatory effects and local genomic features.

Input Representation:

The model processes two complementary genomic contexts for each gene:

  • CRE (Cis-Regulatory Elements) Window: ±1 Mb around the gene body, capturing approximately 1.06 million candidate cis-regulatory elements (cCREs) from the ENCODE Registry. These include promoter-like sequences (PLS), proximal and distal enhancer-like sequences (pELS, dELS), CTCF-bound elements, and chromatin-accessible regions.
  • Transcription Window: Extends from 1kb upstream of the transcription start site to the lesser of 300kb downstream or the gene end. This window captures promoters, 5' and 3' untranslated regions (UTRs), exons, and introns.

For each donor, personalized DNA sequences are generated by embedding individual-specific variants using:

  • Heterozygous variants: IUPAC ambiguity codes (e.g., A/G → R)
  • Homozygous alternate alleles: Direct ALT base substitution
  • Strand-aware processing: Negative-strand genes are reverse-complemented

Sequences are tokenized using Byte-Pair Encoding (BPE) with a 500-token vocabulary trained on ENCODE cCRE sequences, enabling the model to learn biologically meaningful DNA motifs.

Stage 1: Mutation-Aware Sequence Encoders

Pre-trained transformer encoders generate embeddings for donor-specific genomic sequences:

  • Architecture: 12-layer transformer with Flash Attention and ALiBi positional encoding
  • Embedding dimension: 512
  • Pretraining task: Tissue-specific chromatin accessibility classification using ENTEx data (4 donors, 16 tissues)
  • Training objective: Combined contrastive loss + binary cross-entropy to ensure mutation sensitivity
  • Transfer learning strategy:
    • CRE encoders: Frozen during downstream training to preserve learned regulatory grammar
    • Gene encoders: Fine-tuned end-to-end to adapt to expression prediction

Stage 2: Hierarchical Modulator Architecture

Two parallel transformer stacks process regulatory and genic contexts:

Epigenetics Modulator (CRE Processing):

  • 25 transformer encoder layers refine regulatory element representations
  • Self-attention enables bidirectional communication between CREs to model combinatorial regulatory logic
  • Functional annotation cross-attention: Integrates learned embeddings for cCRE types (PLS, dELS, pELS, CTCF-bound, etc.)
  • Output: Multi-scale CRE representations capturing progressively refined regulatory contexts

Gene Modulator (Transcription Window Processing):

  • 25 transformer layers with cross-attention to CRE representations
  • Input: Transcription window partitioned into 200 non-overlapping chunks of 200 tokens each
  • Cross-attention mechanism: Each layer attends to the corresponding CRE representation from the Epigenetics Modulator, modeling enhancer-promoter interactions and distal regulatory effects
  • Hierarchical integration: Mirrors biological regulatory cascades where primary CRE signals are progressively integrated with higher-order interactions

Tissue-Specific Conditioning:

  • Learnable registry tokens for 62 tissues and cell types
  • Prepended to gene representations to condition all attention layers
  • Final registry token representation captures tissue-specific regulatory state

Expression Prediction:

  • 2-layer MLP with GeGLU activation
  • Softplus output ensures non-negative predictions
  • Loss function: Poisson negative log-likelihood, appropriate for RNA-seq count data

Effective Context: >2 Mb regulatory window per gene through dual-window design

Parameters

The model contains 1.2 billion parameters trained on the largest curated collection of paired whole-genome sequencing and bulk RNA-seq samples available.

Citation

Ghosal, S., et al. (2025). VariantFormer: A hierarchical transformer integrating DNA sequences with genetic variation and regulatory landscapes for personalized gene expression prediction. bioRxiv 2025.10.31.685862. DOI: 10.1101/2025.10.31.685862

Model Card Authors

VariantFormer Team, CZI AI

Primary Contact Email

virtualcellmodels@chanzuckerberg.com

System Requirements

VariantFormer requires significant computational resources for inference and depends on specific reference data:

Hardware Requirements:

  • GPU Memory: 16+ GB VRAM recommended for full precision inference
  • System Memory: 32+ GB RAM for data preprocessing and sequence generation
  • Storage: 10+ GB for model checkpoints and reference data
  • Training (for reference): 376x H100 GPUs used for distributed training

Software Requirements:

  • Python: 3.12+
  • PyTorch: 2.0+
  • CUDA: 11.8+ compatible GPU
  • Reference Genome: GRCh38 (hg38) assembly
  • Gene Annotations: GENCODE v24 basic annotations
  • cCRE Registry: ENCODE Registry of candidate cis-regulatory elements

Data Processing Tools:

  • bcftools for variant processing
  • samtools for FASTA indexing
  • BPE tokenizer with 500-token vocabulary (provided with model)

Model Variants

Model
Description
S3 Path
Mutation-aware encodersTransformer-based sequence encoders trained on chromatin activity datahttps://czi-variantformer.s3.us-west-2.amazonaws.com/model/v4_ag/tokenizer_checkpoint.pth
VF-PCGProtein-coding gene expression prediction (18,439 genes across 62 tissues and cell lines). Trained for 12 epochs on protein-coding genes.https://czi-variantformer.s3.us-west-2.amazonaws.com/model/v4_pcg/checkpoint.pth
VF-AGAll annotated gene expression prediction (50,956 genes, including 32,517 non-coding genes across 62 tissues and cell lines). Extended training on the full gene set.https://czi-variantformer.s3.us-west-2.amazonaws.com/model/v4_ag/checkpoint.pth

Notebooks

Notebook
Description
Path
vcf2exp.ipynbTutorial explaining the tissue-specific gene expression prediction from a given VCF with SNPs and indels. Additionally, it shows how to embed a reference genome directly.https://github.com/czi-ai/variantformer/blob/main/notebooks/vcf2exp.ipynb
variant2exp.ipynbA tutorial for predicting population-specific effects of mutations, along with capturing effects of in silico mutation in the context of a sample genome.https://github.com/czi-ai/variantformer/blob/main/notebooks/variant2exp.ipynb
vcf2risk.ipynbPredicting Alzheimer's risk conditioned on gene and tissues from a sample genome.https://github.com/czi-ai/variantformer/blob/main/notebooks/vcf2risk.ipynb
variant2risk.ipynbPredicting the change of in silico edits in the sample genome on Alzheimer's risk.https://github.com/czi-ai/variantformer/blob/main/notebooks/variant2risk.ipynb

Intended Use

Primary Use Cases:

  • Personal genome-aware gene expression prediction
  • Variant effect prediction on gene regulation
  • Tissue-specific expression modeling
  • Disease risk assessment from genetic variants
  • Population genomics analysis

Out-of-Scope or Unauthorized Use Cases:

Do not use the model for the following purposes:

  • Clinical diagnosis or treatment recommendations
  • Direct patient care decisions without experimental validation
  • Population-level discrimination or bias reinforcement
  • Any use that violates applicable laws, regulations (including trade compliance laws), or third-party rights such as privacy or intellectual property rights
  • Any use that is prohibited by the Acceptable Use Policy

Training Details

Training Date

Pretraining: Early 2025 (29 epochs on ENTEx data) Downstream Training: Mid-2025 (Stage 1: 12 epochs, Stage 2: 10 epochs)

Training Data

The model was trained on the largest curated collection of paired whole-genome sequencing (WGS) and bulk RNA-seq data to date, combining:

Primary Datasets:

  • GTEx v10: 19,616 tissue samples across 54 anatomical sites from 948 donors
  • MAGE (1000 Genomes): 731 lymphoblastoid cell lines representing 26 global populations (AFR, AMR, EAS, EUR, SAS)
  • ADNI: 808 participants with WGS data; 650 samples with quality-filtered gene expression from Affymetrix microarray (RIN > 6)
  • ENCODE: 6 cancer cell lines with paired RNA-seq and WGS: A549 (lung), HepG2 (liver), K562 (leukemia), NCI-H460 (lung), Panc1 (pancreas), GM23248 (lymphoblastoid)

Totals:

  • 21,004 RNA-seq samples from 2,330 unique donors
  • 50,956 genes: 18,439 protein-coding + 32,517 non-coding genes
  • Tissues/Cell Types: 62 (54 GTEx tissues + 6 ENCODE cell lines + ADNI blood + MAGE LCL)

Pretraining Data:

  • ENTEx: 4 donors, 16 tissues, tissue-specific chromatin accessibility (DNase-seq, H3K4me3) for mutation-aware encoder pretraining

Population Diversity:

  • Inferred Ancestry (GTEx + ADNI): 64.7% EUR, 13.6% AFR, 8.6% AMR, 6.9% EAS, 6.2% SAS
  • MAGE Reference Populations: Balanced representation across 5 super-populations

Quality Control:

  • RNA Integrity Number (RIN) > 6 for all samples
  • Genes with <20 non-zero expression counts filtered
  • Tissue-specific 10th percentile filtering for non-coding genes

Training Procedure

Training proceeded in two distinct phases: encoder pretraining and downstream gene expression prediction.

Phase 1: Encoder Pretraining (ENTEx Chromatin Activity)

Mutation-aware sequence encoders were pretrained on tissue-specific chromatin accessibility classification:

  • Objective: Binary classification of cCRE activity (active vs. inactive) + contrastive learning across donors
  • Data: ENTEx paired WGS and chromatin data (4 donors, 16 tissues)
  • Architecture: 12-layer transformer encoders (512-dim)
  • Training Configuration:
    • 29 epochs to convergence
    • 8 GPUs, per-device batch size 32, gradient accumulation 12 steps (effective batch 3,072)
    • Adam optimizer: lr=1e-4, weight decay=0.01
    • ReduceLROnPlateau scheduling (patience=2, factor=0.1)
    • Mixed precision training (bfloat16)
  • Holdout: Chromosome 21 for encoder validation

Phase 2: Downstream Gene Expression Prediction

Two-stage training progressively expanded gene coverage:

Stage 1 - Protein-Coding Genes (12 epochs):

  • Genes: 18,439 protein-coding genes
  • Training Strategy:
    • CRE encoders frozen (preserve pretrained regulatory representations)
    • Gene encoders fine-tuned end-to-end
  • Distributed Training:
    • 376 H100 GPUs
    • Per-device batch size: 2 gene-donor pairs
    • Gradient accumulation: 11 steps
    • Tissue sampling: 2 tissues per gene-donor pair
    • Effective batch size: 16,544 samples
  • Optimization:
    • AdamW: lr=1e-4, weight decay=0.01, gradient clipping 1.0
    • Warmup-cosine schedule (1% warmup, min lr=1e-5)
    • Mixed precision (bfloat16)

Stage 2 - All Annotated Genes (10 epochs):

  • Genes: 50,956 total (added 32,517 non-coding genes)
  • Initialization: From Stage 1 checkpoint
  • Learning Rate: Reduced to 4e-5 (min lr=1e-6) for fine-grained adaptation
  • Same distributed configuration as Stage 1

Loss Function:

  • Poisson negative log-likelihood for count-based RNA-seq data
  • Trained on log1p-transformed TPM values

Holdout Strategy:

  • GTEx/MAGE/ADNI: 10% donor-level holdout (stratified by ancestry for MAGE)
  • ENCODE: Chromosome 19 genes held out (somatic variant generalization test)

Training Code

https://github.com/czi-ai/variantformer

Data Sources

Performance Metrics

Evaluation Framework

VariantFormer is evaluated using two complementary metrics that capture different aspects of prediction quality:

  • Gene Correlation: Measures how accurately the model predicts expression variability across donors and tissues for individual genes (Spearman ρ per gene, averaged across genes)
  • Subject Correlation: Measures how accurately the model predicts expression variability across genes within individual samples (Spearman ρ per sample, averaged across samples)

All evaluations performed on held-out test sets with no donor overlap with training data.

Gene Expression Prediction Performance

Table 1: Gene Correlation (Cross-Donor/Tissue Variability)

This metric evaluates the challenging task of predicting individual-level and tissue-specific expression variation:

Model
Protein-Coding (n=18,439)
Non-Protein-Coding (n=32,517)
VariantFormer-AG0.804 (95% CI: 0.802-0.806)0.544 (95% CI: 0.542-0.547)
VariantFormer-PCG0.803 (95% CI: 0.801-0.805)
TWAS Random Forest0.787 (95% CI: 0.785-0.789)0.469 (95% CI: 0.466-0.472)
Enformer0.774 (95% CI: 0.772-0.777)0.507 (95% CI: 0.504-0.510)
Borzoi0.769 (95% CI: 0.767-0.771)0.476 (95% CI: 0.473-0.479)

Key Findings:

  • VariantFormer achieves 2.2% improvement over TWAS and 3.9% improvement over Enformer for protein-coding genes
  • For non-coding genes, VariantFormer shows 16.0% improvement over TWAS and 7.3% improvement over Enformer
  • Gene correlation is more discriminative than subject correlation, revealing meaningful model differences

Table 2: Subject Correlation (Cross-Gene Variability)

This metric primarily captures differences in mean expression between genes (saturated across models):

Model
Protein-Coding
Non-Protein-Coding
VariantFormer-AG0.970.87
TWAS Random Forest0.970.87
Enformer0.960.86
Borzoi0.960.86

Somatic Variant Generalization (ENCODE Cell Lines)

Performance on held-out chromosome 19 genes with high somatic mutation burden:

Cell Line
VariantFormer-AG
Enformer
Borzoi
GM23248 (lymphoblastoid)0.8480.7520.655
Panc1 (pancreatic cancer)0.8400.7130.627
HepG2 (liver cancer)0.8340.7060.613
A549 (lung cancer)0.8050.6780.589
NCI-H460 (lung cancer)0.8000.6680.579
K562 (leukemia)0.7630.6130.549

Note: TWAS models cannot generalize to unseen genes or novel variant combinations, and are excluded from this evaluation.

Variant Effect Prediction (eQTL Validation)

Table 3: eQTL Replication Across Independent Studies

Spearman correlation between predicted variant effects and empirical eQTL slopes:

Dataset
VariantFormer-AG (Ensemble)
Borzoi
AlphaGenome
All variants (6 tissues combined)0.60~0.0~0.0
Rare variants (MAF < 5%)0.200.040.06

Tissues evaluated: TwinsUK (adipose, blood, skin) and brain tissues (substantia nigra, frontal cortex BA9, putamen)

Population-Specific eQTL Performance (BrainSeq frontal cortex):

Variant Set
VF-EUR
VF-AFR
Allele-Freq Weighted
EUR-enriched (AF_EUR > 10%, AF_AFR < 5%)0.27 (AG) / 0.28 (PCG)0.04 (AG) / 0.19 (PCG)0.33 (AG) / 0.38 (PCG)
AFR-enriched (AF_AFR > 10%, AF_EUR < 5%)0.04 (AG) / 0.35 (PCG)0.27 (AG) / 0.23 (PCG)0.20 (AG) / 0.25 (PCG)

Disease Risk Prediction: Alzheimer's Disease

Supervised Classification (ADNI Cohort, n=370)

Using tissue-specific gene embeddings with random forest classifiers on top-10 gene-tissue pairs:

  • VariantFormer-PCG: Best AUPRC on held-out test set (n=40 donors)
  • Cross-validation: Strong performance across 330 training donors
  • Top predictive genes: APOE, TOMM40, and other AD-associated loci identified

Zero-Shot MAGMA Enrichment:

  • Multiple brain tissues show significant enrichment (p < 0.05): anterior cingulate cortex, cerebellar hemisphere, frontal cortex, cortex
  • Tissue-specific signal not achievable with tissue-agnostic models

APOE In-Silico Editing Results:

Variant
Effect Direction
Log Odds Ratio
APOE-ε4 (rs429358)Risk-increasing+1.06 (95% CI: 0.95 to 1.18)
APOE-ε2 (rs7412)Protective-0.29 (95% CI: -0.41 to -0.17)

Predictions recapitulate known APOE allele risk architecture through in-silico mutation of patient genomes.

Evaluation Datasets

  • Independent eQTL studies: eQTL Catalogue (6 tissue-specific datasets: TwinsUK, Braineac2, BrainSeq)
  • Population stratification: 1000 Genomes ancestry-matched validation
  • Rare variant validation: MAF < 5% subset with 10,000 variants per tissue
  • Disease cohorts: ADNI (Alzheimer's), ENCODE cancer cell lines
  • Cross-ancestry testing: EUR, AFR, EAS, SAS, AMR populations

Biases, Risks, and Limitations

Potential Biases

Population Bias:

  • Training data ancestry distribution: 64.7% EUR, 13.6% AFR, 8.6% AMR, 6.9% EAS, 6.2% SAS
  • European ancestry overrepresentation may affect prediction accuracy in underrepresented populations
  • Cross-ancestry validation demonstrates the model maintains reasonable performance across populations, but optimal accuracy achieved with ancestry-matched predictions
  • Mitigation: Allele-frequency weighted ensemble predictions aggregate across 5 super-populations to reduce ancestry-specific bias

Tissue Bias:

  • GTEx tissue representation varies: some tissues have hundreds of samples (e.g., muscle: 818, whole blood: 803), others have fewer (e.g., bladder: 77, fallopian tube: 29)
  • Performance varies across tissue types based on training representation and biological signal-to-noise ratio
  • Brain tissues particularly well-represented due to GTEx

Technical Bias:

  • Model performance dependent on sequencing quality (high-coverage WGS: 30x required)
  • RNA-seq quality filters (RIN > 6) may bias toward well-preserved samples
  • Training on bulk RNA-seq limits resolution to tissue-level averages (no cell-type specificity)

Gene Type Bias:

  • Lower performance on non-coding genes (ρ=0.54) vs protein-coding (ρ=0.80) reflects biological complexity
  • Non-coding RNAs have lower expression, higher noise, and more context-dependent regulation

Risks

Prediction Uncertainty:

  • Model outputs represent statistical predictions, not biological certainties
  • Predictions should not be interpreted as deterministic causal effects
  • Expression predictions are probabilistic estimates with associated confidence intervals

Population Transferability:

  • While cross-ancestry validation shows generalization, performance may degrade for populations underrepresented in training data
  • Rare population-specific variants may not be accurately predicted
  • Linkage disequilibrium patterns differ across ancestries, affecting compound variant effects

Temporal Limitations:

  • Training data reflects genomic knowledge as of 2025
  • Gene annotations (GENCODE v24), cCRE registry (ENCODE 2020), and reference genome (GRCh38) may become outdated
  • Future discoveries may reveal additional regulatory mechanisms not captured by current architecture

Disease Application Risks:

  • Disease risk predictions (e.g., Alzheimer's) are exploratory and require experimental validation
  • Small cohort sizes (ADNI n=370) limit statistical power compared to large GWAS (n>100,000)
  • In-silico editing predictions are counterfactual simulations, not experimental observations

Limitations

Architectural Constraints:

  • Context Window: Regulatory context limited to ±1Mb around gene body; distal interactions beyond 1Mb not captured
  • Transcription Window: Maximum 300kb downstream from TSS; very long genes may be truncated
  • Variant Types: Only SNPs and small indels supported; structural variants, copy number variations, and complex rearrangements excluded
  • Phasing: Model uses IUPAC encoding for heterozygous variants but does not explicitly model haplotype phase

Computational Limitations:

  • GPU Memory: 16GB+ VRAM required for inference limits accessibility
  • Processing Time: Per-gene predictions require processing megabase-scale sequences
  • Scalability: Genome-wide predictions for all genes computationally expensive

Biological Scope:

  • Bulk RNA-seq: Cannot predict cell-type-specific expression or cell-state variation
  • Steady-state expression: Trained on static tissue samples; cannot model dynamic or stimulus-responsive expression
  • Post-transcriptional regulation: Model predicts mRNA abundance; does not capture protein-level regulation, splicing isoforms, or RNA stability

Generalization Boundaries:

  • Unseen tissues: Performance on tissues not in training data unknown
  • Pathological states: Most training data from normal/healthy tissues; disease-state expression may differ
  • Environmental factors: Model cannot account for diet, medications, environmental exposures, or lifestyle factors affecting expression

Caveats and Recommendations

Experimental Validation Required:

  • All predictions should be validated experimentally before drawing biological conclusions
  • In-silico editing and variant effect predictions are computational hypotheses, not experimental evidence
  • Disease risk scores are exploratory and not validated for clinical use

Population-Appropriate Application:

  • Consider ancestry matching for optimal prediction accuracy
  • Use allele-frequency weighted ensemble predictions for diverse populations
  • Interpret predictions cautiously for underrepresented ancestries

Responsible Use:

  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model
  • The model is intended to be used for research purposes only and was not designed for clinical, diagnostic, or therapeutic purposes
  • Do not use predictions to discriminate against individuals or populations

Computational Considerations:

  • GPU resources required; not accessible for all researchers
  • Consider computational cost for large-scale variant screening applications

For security or privacy concerns, contact security@chanzuckerberg.com or privacy@chanzuckerberg.com

Acknowledgements

The VariantFormer team acknowledges the contributions of the GTEx Consortium, ENCODE Project, ADNI initiative, and 1000 Genomes Project for providing the foundational datasets that made this work possible.