CodonFM

Version v1.0.0 released 28 Oct 2025

License

Repository

https://github.com/NVIDIA-Digital-Bio/CodonFM

CodonFM Encodon models predict masked codons in mRNA sequences to enable variant effect interpretation and codon optimization. The model suite spans 80M to 1B parameters, with both randomly masked and codon frequency-aware masking schemes. Designed for mRNA design, expression optimization, variant interpretation, and synthetic biology use cases, the models process full coding sequences at codon resolution for research and development.

Developed By

NVIDIA Corporation

Get Started with Model

Model Details

Model Architecture

The NVIDIA CodonFM Encodon family features Transformer-based architectures tailored for codon-level sequence modeling in mRNA. Each model applies a masked language modeling (MLM) objective to predict masked codons from surrounding context, enabling genome-scale codon optimization and synonymous variant interpretation. The models process sequences up to 2,046 codons (6,138 nucleotides) and output codon probability distributions for each position.

Parameters

Model Name	Parameters
Encodon-80M	7.68 × 10⁷
Encodon-600M	6.09 × 10⁸
Encodon-1B	9.11 × 10⁸
Encodon-Cdwt-1B	9.11 × 10⁸

Model Card Authors

Laksshman Sundaram (NVIDIA)

Primary Contact Email

Laksshman Sundaram nv-codonfm@nvidia.com

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

Requires NVIDIA GPU (Ampere or Hopper architecture or newer) with CUDA support. Linux OS recommended. Optimized for use with CUDA libraries.

Model Variants

Model Variant Name	Description	Access URL
Encodon-80M	80 million parameter BERT-style transformer model trained with random masking.	https://huggingface.co/nvidia/NV-CodonFM-Encodon-80M-v1
Encodon-600M	600 million parameter BERT-style transformer model trained with random masking.	https://huggingface.co/nvidia/NV-CodonFM-Encodon-600M-v1
Encodon-1B	1 billion parameter BERT-style transformer model trained with random masking.	https://huggingface.co/nvidia/NV-CodonFM-Encodon-1B-v1
Encodon-Cdwt-1B	1 billion parameter BERT-style transformer model trained with codon weighted masking.	https://huggingface.co/nvidia/NV-CodonFM-Encodon-Cdwt-1B-v1

Intended Use

Primary Use Cases

mRNA Design (Cell Type-Specific): Optimize codon usage in therapeutic constructs.
mRNA Stability Prediction: Estimate transcript degradation based on codon composition.
Protein Yield Optimization: Maximize protein expression efficiency.
Synonymous Variant Interpretation: Predict the functional impact of synonymous variants.
Codon Bias Analysis: Study evolutionary patterns of codon preference.
Synthetic Biology Applications: Tune codon usage for metabolic or synthetic pathways.
Translation Regulation Studies: Explore codon effects on ribosome dynamics.
Transfer Learning for Genomics: Fine-tune pretrained CodonFM models for specialized tasks.

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Violating laws, regulations, or third-party rights (e.g. privacy, IP).
Any use excluded under the NVIDIA Open Model License Agreement.
Medical or clinical decision-making without biological validation.
Generating synthetic sequences for therapeutic use without oversight.
Use outside of research, academic, or preclinical contexts.

Training Details

Training Date

2024-04-15

Training Data

The models were trained using coding sequences from NCBI RefSeq (Release 2024-04).

Size: 131,000,000 non-viral protein-coding sequences from >20,000 species, including >2,000 eukaryotes.
Filtering: Excluded sequences not divisible by three, containing ambiguous codons, or pathogenic bacterial genes.
Partitioning: Balanced by species within nine major taxonomic groups.
Objective: Masked language modeling (MLM) with random or codon-weighted masking.
Input Format: FASTA-derived memory maps.
Dataset License: Public Domain.

Training Procedure

Training: The models were pretrained using a masked language modeling (MLM) objective, where the model predicts masked codons to learn contextual codon-interplay. The models were trained on over 130 million protein-coding sequences, processing trillions of tokens. The masking strategy can be adjusted to account for codon frequency observed in the pre-training corpus.

Data Curation and Pre-processing: The training corpus was curated from the NCBI RefSeq database, targeting reference assemblies of non-viral species. This raw data underwent a rigorous filtering and cleaning process:

Filtering: Sequences were removed if they (a) had a length not divisible by 3, or (b) contained any missing or ambiguous codons. Sequences originating from the same species were de-duplicated.
Corpus Refinement: Known human-pathogenic bacteria species were explicitly removed from the dataset.

This procedure resulted in a final dataset of >130,000,000 coding sequences from >20,000 distinct species.

Tokenization: The models are codon-resolution language models, meaning they process sequences as trinucleotide tokens (codons) rather than individual nucleotides. The Encodon models accept a maximum input sequence length of 2,046 codon tokens.

Training and Validation Split: To ensure a robust evaluation and prevent data leakage, the dataset was split using a stratified, cluster-based method:

Grouping: All sequences were first divided into 9 high-level groups based on species (e.g., primates, bacteria, fungi, plant, etc.).
Clustering: MMSeqs was used to cluster the coding sequences based on sequence similarity.
Splitting: The resulting clusters were then divided into training and validation sets. This split was stratified by species group to ensure that both sets had a representative distribution of organisms.

Training Code

Available via Github.

Data Sources

The NCBI RefSeq Protein-Coding Sequences dataset (Release 2024-04) was used as the primary data source for training the CodonFM models.

Performance Metrics

Evaluation Datasets

Task	Description
ClinVar variant interpretation	This task involves classifying genetic variants from ClinVar , a publicly available database that aggregates information about the clinical significance of human genetic variants, into pathogenic or benign categories based on their coding sequence context
Denovo variant classification	This task uses variants from the Deciphering Developmental Disorders (DDD) and autism spectrum disorder (ASD) cohort study, which catalogs genetic mutations linked to rare pediatric and developmental diseases, to evaluate classification of pathogenic versus benign variants based on coding sequence context.
mRNA translation efficiency	This task predicts ribosome profiling signal intensity along coding sequences, evaluating how well models capture translation efficiency and codon-level regulation from sequence context (see Zheng et al. 2024 for details).
Protein abundance	This task predicts fluorescent protein expression levels (mRFP) from coding sequences, testing how accurately models capture codon-dependent effects on translation efficiency and protein abundance (see Li et al. 2023 for details).

Biases, Risks, and Limitations

Potential Biases

The model may reflect biases present in the training data.

Risks

Areas of risk may include but are not limited to:

Inaccurate outputs
Potential misuse for incorrect biological interpretations
Unregulated use in clinical or therapeutic contexts
Potential misuse for generating synthetic sequences without ethical review

Limitations

The models were strictly pre-trained on mRNA sequences. Therefore, the models are not expected to perform well with other types of CDS inputs.
Any input sequence beyond 2046 codons requires truncation/windowing.

Caveats and Recommendations

Review and validate outputs generated by the model.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Acknowledgements

Developed by NVIDIA Corporation.

Get Started with Model