Preprint

for our newsletter to be notified when the VariantFormer model card is updated with more information.

VariantFormer

Version v0.1 released 02 Nov 2025

License

MIT

Repository

https://github.com/czi-ai/variantformer

Developed By

VariantFormer Team
Chan Zuckerberg Initiative

A biology-guided, 1.2-billion-parameter transformer that predicts gene-level RNA abundance across diverse tissues and cell lines from personalized DNA sequences. Trained jointly on the largest curated collection of paired whole-genome sequencing and bulk RNA-seq samples to date from GTEx, MAGE, ADNI, and ENCODE datasets.

Try Model with Demo Dataset

Associated Resources

Model Details

Model Architecture

VariantFormer employs a biology-guided, two-stage hierarchical transformer architecture specifically designed for personal genome-aware gene expression prediction. The model integrates long-range cis-regulatory elements (cCREs) with gene-specific transcription windows to capture both distal regulatory effects and local genomic features.

Input Representation:

The model processes two complementary genomic contexts for each gene:

CRE (Cis-Regulatory Elements) Window: ±1 Mb around the gene body, capturing approximately 1.06 million candidate cis-regulatory elements (cCREs) from the ENCODE Registry. These include promoter-like sequences (PLS), proximal and distal enhancer-like sequences (pELS, dELS), CTCF-bound elements, and chromatin-accessible regions.
Transcription Window: Extends from 1kb upstream of the transcription start site to the lesser of 300kb downstream or the gene end. This window captures promoters, 5' and 3' untranslated regions (UTRs), exons, and introns.

For each donor, personalized DNA sequences are generated by embedding individual-specific variants using:

Heterozygous variants: IUPAC ambiguity codes (e.g., A/G → R)
Homozygous alternate alleles: Direct ALT base substitution
Strand-aware processing: Negative-strand genes are reverse-complemented

Sequences are tokenized using Byte-Pair Encoding (BPE) with a 500-token vocabulary trained on ENCODE cCRE sequences, enabling the model to learn biologically meaningful DNA motifs.

Stage 1: Mutation-Aware Sequence Encoders

Pre-trained transformer encoders generate embeddings for donor-specific genomic sequences:

Architecture: 12-layer transformer with Flash Attention and ALiBi positional encoding
Embedding dimension: 512
Pretraining task: Tissue-specific chromatin accessibility classification using ENTEx data (4 donors, 16 tissues)
Training objective: Combined contrastive loss + binary cross-entropy to ensure mutation sensitivity
Transfer learning strategy:
- CRE encoders: Frozen during downstream training to preserve learned regulatory grammar
- Gene encoders: Fine-tuned end-to-end to adapt to expression prediction

Stage 2: Hierarchical Modulator Architecture

Two parallel transformer stacks process regulatory and genic contexts:

Epigenetics Modulator (CRE Processing):

25 transformer encoder layers refine regulatory element representations
Self-attention enables bidirectional communication between CREs to model combinatorial regulatory logic
Functional annotation cross-attention: Integrates learned embeddings for cCRE types (PLS, dELS, pELS, CTCF-bound, etc.)
Output: Multi-scale CRE representations capturing progressively refined regulatory contexts

Gene Modulator (Transcription Window Processing):

25 transformer layers with cross-attention to CRE representations
Input: Transcription window partitioned into 200 non-overlapping chunks of 200 tokens each
Cross-attention mechanism: Each layer attends to the corresponding CRE representation from the Epigenetics Modulator, modeling enhancer-promoter interactions and distal regulatory effects
Hierarchical integration: Mirrors biological regulatory cascades where primary CRE signals are progressively integrated with higher-order interactions

Tissue-Specific Conditioning:

Learnable registry tokens for 62 tissues and cell types
Prepended to gene representations to condition all attention layers
Final registry token representation captures tissue-specific regulatory state

Expression Prediction:

2-layer MLP with GeGLU activation
Softplus output ensures non-negative predictions
Loss function: Poisson negative log-likelihood, appropriate for RNA-seq count data

Effective Context: >2 Mb regulatory window per gene through dual-window design

Parameters

The model contains 1.2 billion parameters trained on the largest curated collection of paired whole-genome sequencing and bulk RNA-seq samples available.

Citation

Ghosal, S., et al. (2025). VariantFormer: A hierarchical transformer integrating DNA sequences with genetic variation and regulatory landscapes for personalized gene expression prediction. bioRxiv 2025.10.31.685862. DOI: 10.1101/2025.10.31.685862

Model Card Authors

VariantFormer Team, CZI AI

Primary Contact Email

virtualcellmodels@chanzuckerberg.com

System Requirements

VariantFormer requires significant computational resources for inference and depends on specific reference data:

Hardware Requirements:

GPU Memory: 16+ GB VRAM recommended for full precision inference
System Memory: 32+ GB RAM for data preprocessing and sequence generation
Storage: 10+ GB for model checkpoints and reference data
Training (for reference): 376x H100 GPUs used for distributed training

Software Requirements:

Python: 3.12+
PyTorch: 2.0+
CUDA: 11.8+ compatible GPU
Reference Genome: GRCh38 (hg38) assembly
Gene Annotations: GENCODE v24 basic annotations
cCRE Registry: ENCODE Registry of candidate cis-regulatory elements

Data Processing Tools:

bcftools for variant processing
samtools for FASTA indexing
BPE tokenizer with 500-token vocabulary (provided with model)

Model Variants

Model	Description	S3 Path
Mutation-aware encoders	Transformer-based sequence encoders trained on chromatin activity data	https://czi-variantformer.s3.us-west-2.amazonaws.com/model/v4_ag/tokenizer_checkpoint.pth
VF-PCG	Protein-coding gene expression prediction (18,439 genes across 62 tissues and cell lines). Trained for 12 epochs on protein-coding genes.	https://czi-variantformer.s3.us-west-2.amazonaws.com/model/v4_pcg/checkpoint.pth
VF-AG	All annotated gene expression prediction (50,956 genes, including 32,517 non-coding genes across 62 tissues and cell lines). Extended training on the full gene set.	https://czi-variantformer.s3.us-west-2.amazonaws.com/model/v4_ag/checkpoint.pth

Notebooks

Notebook	Description	Path
vcf2exp.ipynb	Tutorial explaining the tissue-specific gene expression prediction from a given VCF with SNPs and indels. Additionally, it shows how to embed a reference genome directly.	https://github.com/czi-ai/variantformer/blob/main/notebooks/vcf2exp.ipynb
variant2exp.ipynb	A tutorial for predicting population-specific effects of mutations, along with capturing effects of in silico mutation in the context of a sample genome.	https://github.com/czi-ai/variantformer/blob/main/notebooks/variant2exp.ipynb
vcf2risk.ipynb	Predicting Alzheimer's risk conditioned on gene and tissues from a sample genome.	https://github.com/czi-ai/variantformer/blob/main/notebooks/vcf2risk.ipynb
variant2risk.ipynb	Predicting the change of in silico edits in the sample genome on Alzheimer's risk.	https://github.com/czi-ai/variantformer/blob/main/notebooks/variant2risk.ipynb

Intended Use

Primary Use Cases:

Personal genome-aware gene expression prediction
Variant effect prediction on gene regulation
Tissue-specific expression modeling
Disease risk assessment from genetic variants
Population genomics analysis

Out-of-Scope or Unauthorized Use Cases:

Do not use the model for the following purposes:

Clinical diagnosis or treatment recommendations
Direct patient care decisions without experimental validation
Population-level discrimination or bias reinforcement
Any use that violates applicable laws, regulations (including trade compliance laws), or third-party rights such as privacy or intellectual property rights
Any use that is prohibited by the Acceptable Use Policy

Training Details

Training Date

Pretraining: Early 2025 (29 epochs on ENTEx data) Downstream Training: Mid-2025 (Stage 1: 12 epochs, Stage 2: 10 epochs)

Training Data

The model was trained on the largest curated collection of paired whole-genome sequencing (WGS) and bulk RNA-seq data to date, combining:

Primary Datasets:

GTEx v10: 19,616 tissue samples across 54 anatomical sites from 948 donors
MAGE (1000 Genomes): 731 lymphoblastoid cell lines representing 26 global populations (AFR, AMR, EAS, EUR, SAS)
ADNI: 808 participants with WGS data; 650 samples with quality-filtered gene expression from Affymetrix microarray (RIN > 6)
ENCODE: 6 cancer cell lines with paired RNA-seq and WGS: A549 (lung), HepG2 (liver), K562 (leukemia), NCI-H460 (lung), Panc1 (pancreas), GM23248 (lymphoblastoid)

Totals:

21,004 RNA-seq samples from 2,330 unique donors
50,956 genes: 18,439 protein-coding + 32,517 non-coding genes
Tissues/Cell Types: 62 (54 GTEx tissues + 6 ENCODE cell lines + ADNI blood + MAGE LCL)

Pretraining Data:

ENTEx: 4 donors, 16 tissues, tissue-specific chromatin accessibility (DNase-seq, H3K4me3) for mutation-aware encoder pretraining

Population Diversity:

Inferred Ancestry (GTEx + ADNI): 64.7% EUR, 13.6% AFR, 8.6% AMR, 6.9% EAS, 6.2% SAS
MAGE Reference Populations: Balanced representation across 5 super-populations

Quality Control:

RNA Integrity Number (RIN) > 6 for all samples
Genes with <20 non-zero expression counts filtered
Tissue-specific 10th percentile filtering for non-coding genes

Training Procedure

Training proceeded in two distinct phases: encoder pretraining and downstream gene expression prediction.

Phase 1: Encoder Pretraining (ENTEx Chromatin Activity)

Mutation-aware sequence encoders were pretrained on tissue-specific chromatin accessibility classification:

Objective: Binary classification of cCRE activity (active vs. inactive) + contrastive learning across donors
Data: ENTEx paired WGS and chromatin data (4 donors, 16 tissues)
Architecture: 12-layer transformer encoders (512-dim)
Training Configuration:
- 29 epochs to convergence
- 8 GPUs, per-device batch size 32, gradient accumulation 12 steps (effective batch 3,072)
- Adam optimizer: lr=1e-4, weight decay=0.01
- ReduceLROnPlateau scheduling (patience=2, factor=0.1)
- Mixed precision training (bfloat16)
Holdout: Chromosome 21 for encoder validation

Phase 2: Downstream Gene Expression Prediction

Two-stage training progressively expanded gene coverage:

Stage 1 - Protein-Coding Genes (12 epochs):

Genes: 18,439 protein-coding genes
Training Strategy:
- CRE encoders frozen (preserve pretrained regulatory representations)
- Gene encoders fine-tuned end-to-end
Distributed Training:
- 376 H100 GPUs
- Per-device batch size: 2 gene-donor pairs
- Gradient accumulation: 11 steps
- Tissue sampling: 2 tissues per gene-donor pair
- Effective batch size: 16,544 samples
Optimization:
- AdamW: lr=1e-4, weight decay=0.01, gradient clipping 1.0
- Warmup-cosine schedule (1% warmup, min lr=1e-5)
- Mixed precision (bfloat16)

Stage 2 - All Annotated Genes (10 epochs):

Genes: 50,956 total (added 32,517 non-coding genes)
Initialization: From Stage 1 checkpoint
Learning Rate: Reduced to 4e-5 (min lr=1e-6) for fine-grained adaptation
Same distributed configuration as Stage 1

Loss Function:

Poisson negative log-likelihood for count-based RNA-seq data
Trained on log1p-transformed TPM values

Holdout Strategy:

GTEx/MAGE/ADNI: 10% donor-level holdout (stratified by ancestry for MAGE)
ENCODE: Chromosome 19 genes held out (somatic variant generalization test)

Training Code

https://github.com/czi-ai/variantformer

Data Sources

Performance Metrics

Evaluation Framework

VariantFormer is evaluated using two complementary metrics that capture different aspects of prediction quality:

Gene Correlation: Measures how accurately the model predicts expression variability across donors and tissues for individual genes (Spearman ρ per gene, averaged across genes)
Subject Correlation: Measures how accurately the model predicts expression variability across genes within individual samples (Spearman ρ per sample, averaged across samples)

All evaluations performed on held-out test sets with no donor overlap with training data.

Gene Expression Prediction Performance

Table 1: Gene Correlation (Cross-Donor/Tissue Variability)

This metric evaluates the challenging task of predicting individual-level and tissue-specific expression variation:

Model	Protein-Coding (n=18,439)	Non-Protein-Coding (n=32,517)
VariantFormer-AG	0.804 (95% CI: 0.802-0.806)	0.544 (95% CI: 0.542-0.547)
VariantFormer-PCG	0.803 (95% CI: 0.801-0.805)	—
TWAS Random Forest	0.787 (95% CI: 0.785-0.789)	0.469 (95% CI: 0.466-0.472)
Enformer	0.774 (95% CI: 0.772-0.777)	0.507 (95% CI: 0.504-0.510)
Borzoi	0.769 (95% CI: 0.767-0.771)	0.476 (95% CI: 0.473-0.479)

Key Findings:

VariantFormer achieves 2.2% improvement over TWAS and 3.9% improvement over Enformer for protein-coding genes
For non-coding genes, VariantFormer shows 16.0% improvement over TWAS and 7.3% improvement over Enformer
Gene correlation is more discriminative than subject correlation, revealing meaningful model differences

Table 2: Subject Correlation (Cross-Gene Variability)

This metric primarily captures differences in mean expression between genes (saturated across models):

Model	Protein-Coding	Non-Protein-Coding
VariantFormer-AG	0.97	0.87
TWAS Random Forest	0.97	0.87
Enformer	0.96	0.86
Borzoi	0.96	0.86

Somatic Variant Generalization (ENCODE Cell Lines)

Performance on held-out chromosome 19 genes with high somatic mutation burden:

Cell Line	VariantFormer-AG	Enformer	Borzoi
GM23248 (lymphoblastoid)	0.848	0.752	0.655
Panc1 (pancreatic cancer)	0.840	0.713	0.627
HepG2 (liver cancer)	0.834	0.706	0.613
A549 (lung cancer)	0.805	0.678	0.589
NCI-H460 (lung cancer)	0.800	0.668	0.579
K562 (leukemia)	0.763	0.613	0.549

Note: TWAS models cannot generalize to unseen genes or novel variant combinations, and are excluded from this evaluation.

Variant Effect Prediction (eQTL Validation)

Table 3: eQTL Replication Across Independent Studies

Spearman correlation between predicted variant effects and empirical eQTL slopes:

Dataset	VariantFormer-AG (Ensemble)	Borzoi	AlphaGenome
All variants (6 tissues combined)	0.60	~0.0	~0.0
Rare variants (MAF < 5%)	0.20	0.04	0.06

Tissues evaluated: TwinsUK (adipose, blood, skin) and brain tissues (substantia nigra, frontal cortex BA9, putamen)

Population-Specific eQTL Performance (BrainSeq frontal cortex):

Variant Set	VF-EUR	VF-AFR	Allele-Freq Weighted
EUR-enriched (AF_EUR > 10%, AF_AFR < 5%)	0.27 (AG) / 0.28 (PCG)	0.04 (AG) / 0.19 (PCG)	0.33 (AG) / 0.38 (PCG)
AFR-enriched (AF_AFR > 10%, AF_EUR < 5%)	0.04 (AG) / 0.35 (PCG)	0.27 (AG) / 0.23 (PCG)	0.20 (AG) / 0.25 (PCG)

Disease Risk Prediction: Alzheimer's Disease

Supervised Classification (ADNI Cohort, n=370)

Using tissue-specific gene embeddings with random forest classifiers on top-10 gene-tissue pairs:

VariantFormer-PCG: Best AUPRC on held-out test set (n=40 donors)
Cross-validation: Strong performance across 330 training donors
Top predictive genes: APOE, TOMM40, and other AD-associated loci identified

Zero-Shot MAGMA Enrichment:

Multiple brain tissues show significant enrichment (p < 0.05): anterior cingulate cortex, cerebellar hemisphere, frontal cortex, cortex
Tissue-specific signal not achievable with tissue-agnostic models

APOE In-Silico Editing Results:

Variant	Effect Direction	Log Odds Ratio
APOE-ε4 (rs429358)	Risk-increasing	+1.06 (95% CI: 0.95 to 1.18)
APOE-ε2 (rs7412)	Protective	-0.29 (95% CI: -0.41 to -0.17)

Predictions recapitulate known APOE allele risk architecture through in-silico mutation of patient genomes.

Evaluation Datasets

Independent eQTL studies: eQTL Catalogue (6 tissue-specific datasets: TwinsUK, Braineac2, BrainSeq)
Population stratification: 1000 Genomes ancestry-matched validation
Rare variant validation: MAF < 5% subset with 10,000 variants per tissue
Disease cohorts: ADNI (Alzheimer's), ENCODE cancer cell lines
Cross-ancestry testing: EUR, AFR, EAS, SAS, AMR populations

Biases, Risks, and Limitations

Potential Biases

Population Bias:

Training data ancestry distribution: 64.7% EUR, 13.6% AFR, 8.6% AMR, 6.9% EAS, 6.2% SAS
European ancestry overrepresentation may affect prediction accuracy in underrepresented populations
Cross-ancestry validation demonstrates the model maintains reasonable performance across populations, but optimal accuracy achieved with ancestry-matched predictions
Mitigation: Allele-frequency weighted ensemble predictions aggregate across 5 super-populations to reduce ancestry-specific bias

Tissue Bias:

GTEx tissue representation varies: some tissues have hundreds of samples (e.g., muscle: 818, whole blood: 803), others have fewer (e.g., bladder: 77, fallopian tube: 29)
Performance varies across tissue types based on training representation and biological signal-to-noise ratio
Brain tissues particularly well-represented due to GTEx

Technical Bias:

Model performance dependent on sequencing quality (high-coverage WGS: 30x required)
RNA-seq quality filters (RIN > 6) may bias toward well-preserved samples
Training on bulk RNA-seq limits resolution to tissue-level averages (no cell-type specificity)

Gene Type Bias:

Lower performance on non-coding genes (ρ=0.54) vs protein-coding (ρ=0.80) reflects biological complexity
Non-coding RNAs have lower expression, higher noise, and more context-dependent regulation

Risks

Prediction Uncertainty:

Model outputs represent statistical predictions, not biological certainties
Predictions should not be interpreted as deterministic causal effects
Expression predictions are probabilistic estimates with associated confidence intervals

Population Transferability:

While cross-ancestry validation shows generalization, performance may degrade for populations underrepresented in training data
Rare population-specific variants may not be accurately predicted
Linkage disequilibrium patterns differ across ancestries, affecting compound variant effects

Temporal Limitations:

Training data reflects genomic knowledge as of 2025
Gene annotations (GENCODE v24), cCRE registry (ENCODE 2020), and reference genome (GRCh38) may become outdated
Future discoveries may reveal additional regulatory mechanisms not captured by current architecture

Disease Application Risks:

Disease risk predictions (e.g., Alzheimer's) are exploratory and require experimental validation
Small cohort sizes (ADNI n=370) limit statistical power compared to large GWAS (n>100,000)
In-silico editing predictions are counterfactual simulations, not experimental observations

Limitations

Architectural Constraints:

Context Window: Regulatory context limited to ±1Mb around gene body; distal interactions beyond 1Mb not captured
Transcription Window: Maximum 300kb downstream from TSS; very long genes may be truncated
Variant Types: Only SNPs and small indels supported; structural variants, copy number variations, and complex rearrangements excluded
Phasing: Model uses IUPAC encoding for heterozygous variants but does not explicitly model haplotype phase

Computational Limitations:

GPU Memory: 16GB+ VRAM required for inference limits accessibility
Processing Time: Per-gene predictions require processing megabase-scale sequences
Scalability: Genome-wide predictions for all genes computationally expensive

Biological Scope:

Bulk RNA-seq: Cannot predict cell-type-specific expression or cell-state variation
Steady-state expression: Trained on static tissue samples; cannot model dynamic or stimulus-responsive expression
Post-transcriptional regulation: Model predicts mRNA abundance; does not capture protein-level regulation, splicing isoforms, or RNA stability

Generalization Boundaries:

Unseen tissues: Performance on tissues not in training data unknown
Pathological states: Most training data from normal/healthy tissues; disease-state expression may differ
Environmental factors: Model cannot account for diet, medications, environmental exposures, or lifestyle factors affecting expression

Caveats and Recommendations

Experimental Validation Required:

All predictions should be validated experimentally before drawing biological conclusions
In-silico editing and variant effect predictions are computational hypotheses, not experimental evidence
Disease risk scores are exploratory and not validated for clinical use

Population-Appropriate Application:

Consider ancestry matching for optimal prediction accuracy
Use allele-frequency weighted ensemble predictions for diverse populations
Interpret predictions cautiously for underrepresented ancestries

Responsible Use:

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model
The model is intended to be used for research purposes only and was not designed for clinical, diagnostic, or therapeutic purposes
Do not use predictions to discriminate against individuals or populations

Computational Considerations:

GPU resources required; not accessible for all researchers
Consider computational cost for large-scale variant screening applications

For security or privacy concerns, contact security@chanzuckerberg.com or privacy@chanzuckerberg.com

Acknowledgements

The VariantFormer team acknowledges the contributions of the GTEx Consortium, ENCODE Project, ADNI initiative, and 1000 Genomes Project for providing the foundational datasets that made this work possible.

Try Model with Demo Dataset

Associated Resources