VariantFormer
Version v0.1 released 02 Nov 2025
License
MITRepository
https://github.com/czi-ai/variantformerDeveloped By
- VariantFormer Team
- Chan Zuckerberg Initiative
A biology-guided, 1.2-billion-parameter transformer that predicts gene-level RNA abundance across diverse tissues and cell lines from personalized DNA sequences. Trained jointly on the largest curated collection of paired whole-genome sequencing and bulk RNA-seq samples to date from GTEx, MAGE, ADNI, and ENCODE datasets.
Model Details
Model Architecture
VariantFormer employs a biology-guided, two-stage hierarchical transformer architecture specifically designed for personal genome-aware gene expression prediction. The model integrates long-range cis-regulatory elements (cCREs) with gene-specific transcription windows to capture both distal regulatory effects and local genomic features.
Input Representation:
The model processes two complementary genomic contexts for each gene:
- CRE (Cis-Regulatory Elements) Window: ±1 Mb around the gene body, capturing approximately 1.06 million candidate cis-regulatory elements (cCREs) from the ENCODE Registry. These include promoter-like sequences (PLS), proximal and distal enhancer-like sequences (pELS, dELS), CTCF-bound elements, and chromatin-accessible regions.
- Transcription Window: Extends from 1kb upstream of the transcription start site to the lesser of 300kb downstream or the gene end. This window captures promoters, 5' and 3' untranslated regions (UTRs), exons, and introns.
For each donor, personalized DNA sequences are generated by embedding individual-specific variants using:
- Heterozygous variants: IUPAC ambiguity codes (e.g., A/G → R)
- Homozygous alternate alleles: Direct ALT base substitution
- Strand-aware processing: Negative-strand genes are reverse-complemented
Sequences are tokenized using Byte-Pair Encoding (BPE) with a 500-token vocabulary trained on ENCODE cCRE sequences, enabling the model to learn biologically meaningful DNA motifs.
Stage 1: Mutation-Aware Sequence Encoders
Pre-trained transformer encoders generate embeddings for donor-specific genomic sequences:
- Architecture: 12-layer transformer with Flash Attention and ALiBi positional encoding
- Embedding dimension: 512
- Pretraining task: Tissue-specific chromatin accessibility classification using ENTEx data (4 donors, 16 tissues)
- Training objective: Combined contrastive loss + binary cross-entropy to ensure mutation sensitivity
- Transfer learning strategy:
- CRE encoders: Frozen during downstream training to preserve learned regulatory grammar
- Gene encoders: Fine-tuned end-to-end to adapt to expression prediction
Stage 2: Hierarchical Modulator Architecture
Two parallel transformer stacks process regulatory and genic contexts:
Epigenetics Modulator (CRE Processing):
- 25 transformer encoder layers refine regulatory element representations
- Self-attention enables bidirectional communication between CREs to model combinatorial regulatory logic
- Functional annotation cross-attention: Integrates learned embeddings for cCRE types (PLS, dELS, pELS, CTCF-bound, etc.)
- Output: Multi-scale CRE representations capturing progressively refined regulatory contexts
Gene Modulator (Transcription Window Processing):
- 25 transformer layers with cross-attention to CRE representations
- Input: Transcription window partitioned into 200 non-overlapping chunks of 200 tokens each
- Cross-attention mechanism: Each layer attends to the corresponding CRE representation from the Epigenetics Modulator, modeling enhancer-promoter interactions and distal regulatory effects
- Hierarchical integration: Mirrors biological regulatory cascades where primary CRE signals are progressively integrated with higher-order interactions
Tissue-Specific Conditioning:
- Learnable registry tokens for 62 tissues and cell types
- Prepended to gene representations to condition all attention layers
- Final registry token representation captures tissue-specific regulatory state
Expression Prediction:
- 2-layer MLP with GeGLU activation
- Softplus output ensures non-negative predictions
- Loss function: Poisson negative log-likelihood, appropriate for RNA-seq count data
Effective Context: >2 Mb regulatory window per gene through dual-window design
Parameters
The model contains 1.2 billion parameters trained on the largest curated collection of paired whole-genome sequencing and bulk RNA-seq samples available.
Citation
Ghosal, S., et al. (2025). VariantFormer: A hierarchical transformer integrating DNA sequences with genetic variation and regulatory landscapes for personalized gene expression prediction. bioRxiv 2025.10.31.685862. DOI: 10.1101/2025.10.31.685862
Model Card Authors
VariantFormer Team, CZI AI
Primary Contact Email
virtualcellmodels@chanzuckerberg.comSystem Requirements
VariantFormer requires significant computational resources for inference and depends on specific reference data:
Hardware Requirements:
- GPU Memory: 16+ GB VRAM recommended for full precision inference
- System Memory: 32+ GB RAM for data preprocessing and sequence generation
- Storage: 10+ GB for model checkpoints and reference data
- Training (for reference): 376x H100 GPUs used for distributed training
Software Requirements:
- Python: 3.12+
- PyTorch: 2.0+
- CUDA: 11.8+ compatible GPU
- Reference Genome: GRCh38 (hg38) assembly
- Gene Annotations: GENCODE v24 basic annotations
- cCRE Registry: ENCODE Registry of candidate cis-regulatory elements
Data Processing Tools:
- bcftools for variant processing
- samtools for FASTA indexing
- BPE tokenizer with 500-token vocabulary (provided with model)
Model Variants
Model | Description | S3 Path |
|---|---|---|
| Mutation-aware encoders | Transformer-based sequence encoders trained on chromatin activity data | https://czi-variantformer.s3.us-west-2.amazonaws.com/model/v4_ag/tokenizer_checkpoint.pth |
| VF-PCG | Protein-coding gene expression prediction (18,439 genes across 62 tissues and cell lines). Trained for 12 epochs on protein-coding genes. | https://czi-variantformer.s3.us-west-2.amazonaws.com/model/v4_pcg/checkpoint.pth |
| VF-AG | All annotated gene expression prediction (50,956 genes, including 32,517 non-coding genes across 62 tissues and cell lines). Extended training on the full gene set. | https://czi-variantformer.s3.us-west-2.amazonaws.com/model/v4_ag/checkpoint.pth |
Notebooks
Notebook | Description | Path |
|---|---|---|
| vcf2exp.ipynb | Tutorial explaining the tissue-specific gene expression prediction from a given VCF with SNPs and indels. Additionally, it shows how to embed a reference genome directly. | https://github.com/czi-ai/variantformer/blob/main/notebooks/vcf2exp.ipynb |
| variant2exp.ipynb | A tutorial for predicting population-specific effects of mutations, along with capturing effects of in silico mutation in the context of a sample genome. | https://github.com/czi-ai/variantformer/blob/main/notebooks/variant2exp.ipynb |
| vcf2risk.ipynb | Predicting Alzheimer's risk conditioned on gene and tissues from a sample genome. | https://github.com/czi-ai/variantformer/blob/main/notebooks/vcf2risk.ipynb |
| variant2risk.ipynb | Predicting the change of in silico edits in the sample genome on Alzheimer's risk. | https://github.com/czi-ai/variantformer/blob/main/notebooks/variant2risk.ipynb |
Intended Use
Primary Use Cases:
- Personal genome-aware gene expression prediction
- Variant effect prediction on gene regulation
- Tissue-specific expression modeling
- Disease risk assessment from genetic variants
- Population genomics analysis
Out-of-Scope or Unauthorized Use Cases:
Do not use the model for the following purposes:
- Clinical diagnosis or treatment recommendations
- Direct patient care decisions without experimental validation
- Population-level discrimination or bias reinforcement
- Any use that violates applicable laws, regulations (including trade compliance laws), or third-party rights such as privacy or intellectual property rights
- Any use that is prohibited by the Acceptable Use Policy
Training Details
Training Date
Pretraining: Early 2025 (29 epochs on ENTEx data) Downstream Training: Mid-2025 (Stage 1: 12 epochs, Stage 2: 10 epochs)
Training Data
The model was trained on the largest curated collection of paired whole-genome sequencing (WGS) and bulk RNA-seq data to date, combining:
Primary Datasets:
- GTEx v10: 19,616 tissue samples across 54 anatomical sites from 948 donors
- MAGE (1000 Genomes): 731 lymphoblastoid cell lines representing 26 global populations (AFR, AMR, EAS, EUR, SAS)
- ADNI: 808 participants with WGS data; 650 samples with quality-filtered gene expression from Affymetrix microarray (RIN > 6)
- ENCODE: 6 cancer cell lines with paired RNA-seq and WGS: A549 (lung), HepG2 (liver), K562 (leukemia), NCI-H460 (lung), Panc1 (pancreas), GM23248 (lymphoblastoid)
Totals:
- 21,004 RNA-seq samples from 2,330 unique donors
- 50,956 genes: 18,439 protein-coding + 32,517 non-coding genes
- Tissues/Cell Types: 62 (54 GTEx tissues + 6 ENCODE cell lines + ADNI blood + MAGE LCL)
Pretraining Data:
- ENTEx: 4 donors, 16 tissues, tissue-specific chromatin accessibility (DNase-seq, H3K4me3) for mutation-aware encoder pretraining
Population Diversity:
- Inferred Ancestry (GTEx + ADNI): 64.7% EUR, 13.6% AFR, 8.6% AMR, 6.9% EAS, 6.2% SAS
- MAGE Reference Populations: Balanced representation across 5 super-populations
Quality Control:
- RNA Integrity Number (RIN) > 6 for all samples
- Genes with <20 non-zero expression counts filtered
- Tissue-specific 10th percentile filtering for non-coding genes
Training Procedure
Training proceeded in two distinct phases: encoder pretraining and downstream gene expression prediction.
Phase 1: Encoder Pretraining (ENTEx Chromatin Activity)
Mutation-aware sequence encoders were pretrained on tissue-specific chromatin accessibility classification:
- Objective: Binary classification of cCRE activity (active vs. inactive) + contrastive learning across donors
- Data: ENTEx paired WGS and chromatin data (4 donors, 16 tissues)
- Architecture: 12-layer transformer encoders (512-dim)
- Training Configuration:
- 29 epochs to convergence
- 8 GPUs, per-device batch size 32, gradient accumulation 12 steps (effective batch 3,072)
- Adam optimizer: lr=1e-4, weight decay=0.01
- ReduceLROnPlateau scheduling (patience=2, factor=0.1)
- Mixed precision training (bfloat16)
- Holdout: Chromosome 21 for encoder validation
Phase 2: Downstream Gene Expression Prediction
Two-stage training progressively expanded gene coverage:
Stage 1 - Protein-Coding Genes (12 epochs):
- Genes: 18,439 protein-coding genes
- Training Strategy:
- CRE encoders frozen (preserve pretrained regulatory representations)
- Gene encoders fine-tuned end-to-end
- Distributed Training:
- 376 H100 GPUs
- Per-device batch size: 2 gene-donor pairs
- Gradient accumulation: 11 steps
- Tissue sampling: 2 tissues per gene-donor pair
- Effective batch size: 16,544 samples
- Optimization:
- AdamW: lr=1e-4, weight decay=0.01, gradient clipping 1.0
- Warmup-cosine schedule (1% warmup, min lr=1e-5)
- Mixed precision (bfloat16)
Stage 2 - All Annotated Genes (10 epochs):
- Genes: 50,956 total (added 32,517 non-coding genes)
- Initialization: From Stage 1 checkpoint
- Learning Rate: Reduced to 4e-5 (min lr=1e-6) for fine-grained adaptation
- Same distributed configuration as Stage 1
Loss Function:
- Poisson negative log-likelihood for count-based RNA-seq data
- Trained on log1p-transformed TPM values
Holdout Strategy:
- GTEx/MAGE/ADNI: 10% donor-level holdout (stratified by ancestry for MAGE)
- ENCODE: Chromosome 19 genes held out (somatic variant generalization test)
Training Code
https://github.com/czi-ai/variantformerData Sources
Performance Metrics
Evaluation Framework
VariantFormer is evaluated using two complementary metrics that capture different aspects of prediction quality:
- Gene Correlation: Measures how accurately the model predicts expression variability across donors and tissues for individual genes (Spearman ρ per gene, averaged across genes)
- Subject Correlation: Measures how accurately the model predicts expression variability across genes within individual samples (Spearman ρ per sample, averaged across samples)
All evaluations performed on held-out test sets with no donor overlap with training data.
Gene Expression Prediction Performance
Table 1: Gene Correlation (Cross-Donor/Tissue Variability)
This metric evaluates the challenging task of predicting individual-level and tissue-specific expression variation:
Model | Protein-Coding (n=18,439) | Non-Protein-Coding (n=32,517) |
|---|---|---|
| VariantFormer-AG | 0.804 (95% CI: 0.802-0.806) | 0.544 (95% CI: 0.542-0.547) |
| VariantFormer-PCG | 0.803 (95% CI: 0.801-0.805) | — |
| TWAS Random Forest | 0.787 (95% CI: 0.785-0.789) | 0.469 (95% CI: 0.466-0.472) |
| Enformer | 0.774 (95% CI: 0.772-0.777) | 0.507 (95% CI: 0.504-0.510) |
| Borzoi | 0.769 (95% CI: 0.767-0.771) | 0.476 (95% CI: 0.473-0.479) |
Key Findings:
- VariantFormer achieves 2.2% improvement over TWAS and 3.9% improvement over Enformer for protein-coding genes
- For non-coding genes, VariantFormer shows 16.0% improvement over TWAS and 7.3% improvement over Enformer
- Gene correlation is more discriminative than subject correlation, revealing meaningful model differences
Table 2: Subject Correlation (Cross-Gene Variability)
This metric primarily captures differences in mean expression between genes (saturated across models):
Model | Protein-Coding | Non-Protein-Coding |
|---|---|---|
| VariantFormer-AG | 0.97 | 0.87 |
| TWAS Random Forest | 0.97 | 0.87 |
| Enformer | 0.96 | 0.86 |
| Borzoi | 0.96 | 0.86 |
Somatic Variant Generalization (ENCODE Cell Lines)
Performance on held-out chromosome 19 genes with high somatic mutation burden:
Cell Line | VariantFormer-AG | Enformer | Borzoi |
|---|---|---|---|
| GM23248 (lymphoblastoid) | 0.848 | 0.752 | 0.655 |
| Panc1 (pancreatic cancer) | 0.840 | 0.713 | 0.627 |
| HepG2 (liver cancer) | 0.834 | 0.706 | 0.613 |
| A549 (lung cancer) | 0.805 | 0.678 | 0.589 |
| NCI-H460 (lung cancer) | 0.800 | 0.668 | 0.579 |
| K562 (leukemia) | 0.763 | 0.613 | 0.549 |
Note: TWAS models cannot generalize to unseen genes or novel variant combinations, and are excluded from this evaluation.
Variant Effect Prediction (eQTL Validation)
Table 3: eQTL Replication Across Independent Studies
Spearman correlation between predicted variant effects and empirical eQTL slopes:
Dataset | VariantFormer-AG (Ensemble) | Borzoi | AlphaGenome |
|---|---|---|---|
| All variants (6 tissues combined) | 0.60 | ~0.0 | ~0.0 |
| Rare variants (MAF < 5%) | 0.20 | 0.04 | 0.06 |
Tissues evaluated: TwinsUK (adipose, blood, skin) and brain tissues (substantia nigra, frontal cortex BA9, putamen)
Population-Specific eQTL Performance (BrainSeq frontal cortex):
Variant Set | VF-EUR | VF-AFR | Allele-Freq Weighted |
|---|---|---|---|
| EUR-enriched (AF_EUR > 10%, AF_AFR < 5%) | 0.27 (AG) / 0.28 (PCG) | 0.04 (AG) / 0.19 (PCG) | 0.33 (AG) / 0.38 (PCG) |
| AFR-enriched (AF_AFR > 10%, AF_EUR < 5%) | 0.04 (AG) / 0.35 (PCG) | 0.27 (AG) / 0.23 (PCG) | 0.20 (AG) / 0.25 (PCG) |
Disease Risk Prediction: Alzheimer's Disease
Supervised Classification (ADNI Cohort, n=370)
Using tissue-specific gene embeddings with random forest classifiers on top-10 gene-tissue pairs:
- VariantFormer-PCG: Best AUPRC on held-out test set (n=40 donors)
- Cross-validation: Strong performance across 330 training donors
- Top predictive genes: APOE, TOMM40, and other AD-associated loci identified
Zero-Shot MAGMA Enrichment:
- Multiple brain tissues show significant enrichment (p < 0.05): anterior cingulate cortex, cerebellar hemisphere, frontal cortex, cortex
- Tissue-specific signal not achievable with tissue-agnostic models
APOE In-Silico Editing Results:
Variant | Effect Direction | Log Odds Ratio |
|---|---|---|
| APOE-ε4 (rs429358) | Risk-increasing | +1.06 (95% CI: 0.95 to 1.18) |
| APOE-ε2 (rs7412) | Protective | -0.29 (95% CI: -0.41 to -0.17) |
Predictions recapitulate known APOE allele risk architecture through in-silico mutation of patient genomes.
Evaluation Datasets
- Independent eQTL studies: eQTL Catalogue (6 tissue-specific datasets: TwinsUK, Braineac2, BrainSeq)
- Population stratification: 1000 Genomes ancestry-matched validation
- Rare variant validation: MAF < 5% subset with 10,000 variants per tissue
- Disease cohorts: ADNI (Alzheimer's), ENCODE cancer cell lines
- Cross-ancestry testing: EUR, AFR, EAS, SAS, AMR populations
Biases, Risks, and Limitations
Potential Biases
Population Bias:
- Training data ancestry distribution: 64.7% EUR, 13.6% AFR, 8.6% AMR, 6.9% EAS, 6.2% SAS
- European ancestry overrepresentation may affect prediction accuracy in underrepresented populations
- Cross-ancestry validation demonstrates the model maintains reasonable performance across populations, but optimal accuracy achieved with ancestry-matched predictions
- Mitigation: Allele-frequency weighted ensemble predictions aggregate across 5 super-populations to reduce ancestry-specific bias
Tissue Bias:
- GTEx tissue representation varies: some tissues have hundreds of samples (e.g., muscle: 818, whole blood: 803), others have fewer (e.g., bladder: 77, fallopian tube: 29)
- Performance varies across tissue types based on training representation and biological signal-to-noise ratio
- Brain tissues particularly well-represented due to GTEx
Technical Bias:
- Model performance dependent on sequencing quality (high-coverage WGS: 30x required)
- RNA-seq quality filters (RIN > 6) may bias toward well-preserved samples
- Training on bulk RNA-seq limits resolution to tissue-level averages (no cell-type specificity)
Gene Type Bias:
- Lower performance on non-coding genes (ρ=0.54) vs protein-coding (ρ=0.80) reflects biological complexity
- Non-coding RNAs have lower expression, higher noise, and more context-dependent regulation
Risks
Prediction Uncertainty:
- Model outputs represent statistical predictions, not biological certainties
- Predictions should not be interpreted as deterministic causal effects
- Expression predictions are probabilistic estimates with associated confidence intervals
Population Transferability:
- While cross-ancestry validation shows generalization, performance may degrade for populations underrepresented in training data
- Rare population-specific variants may not be accurately predicted
- Linkage disequilibrium patterns differ across ancestries, affecting compound variant effects
Temporal Limitations:
- Training data reflects genomic knowledge as of 2025
- Gene annotations (GENCODE v24), cCRE registry (ENCODE 2020), and reference genome (GRCh38) may become outdated
- Future discoveries may reveal additional regulatory mechanisms not captured by current architecture
Disease Application Risks:
- Disease risk predictions (e.g., Alzheimer's) are exploratory and require experimental validation
- Small cohort sizes (ADNI n=370) limit statistical power compared to large GWAS (n>100,000)
- In-silico editing predictions are counterfactual simulations, not experimental observations
Limitations
Architectural Constraints:
- Context Window: Regulatory context limited to ±1Mb around gene body; distal interactions beyond 1Mb not captured
- Transcription Window: Maximum 300kb downstream from TSS; very long genes may be truncated
- Variant Types: Only SNPs and small indels supported; structural variants, copy number variations, and complex rearrangements excluded
- Phasing: Model uses IUPAC encoding for heterozygous variants but does not explicitly model haplotype phase
Computational Limitations:
- GPU Memory: 16GB+ VRAM required for inference limits accessibility
- Processing Time: Per-gene predictions require processing megabase-scale sequences
- Scalability: Genome-wide predictions for all genes computationally expensive
Biological Scope:
- Bulk RNA-seq: Cannot predict cell-type-specific expression or cell-state variation
- Steady-state expression: Trained on static tissue samples; cannot model dynamic or stimulus-responsive expression
- Post-transcriptional regulation: Model predicts mRNA abundance; does not capture protein-level regulation, splicing isoforms, or RNA stability
Generalization Boundaries:
- Unseen tissues: Performance on tissues not in training data unknown
- Pathological states: Most training data from normal/healthy tissues; disease-state expression may differ
- Environmental factors: Model cannot account for diet, medications, environmental exposures, or lifestyle factors affecting expression
Caveats and Recommendations
Experimental Validation Required:
- All predictions should be validated experimentally before drawing biological conclusions
- In-silico editing and variant effect predictions are computational hypotheses, not experimental evidence
- Disease risk scores are exploratory and not validated for clinical use
Population-Appropriate Application:
- Consider ancestry matching for optimal prediction accuracy
- Use allele-frequency weighted ensemble predictions for diverse populations
- Interpret predictions cautiously for underrepresented ancestries
Responsible Use:
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model
- The model is intended to be used for research purposes only and was not designed for clinical, diagnostic, or therapeutic purposes
- Do not use predictions to discriminate against individuals or populations
Computational Considerations:
- GPU resources required; not accessible for all researchers
- Consider computational cost for large-scale variant screening applications
For security or privacy concerns, contact security@chanzuckerberg.com or privacy@chanzuckerberg.com
Acknowledgements
The VariantFormer team acknowledges the contributions of the GTEx Consortium, ENCODE Project, ADNI initiative, and 1000 Genomes Project for providing the foundational datasets that made this work possible.