CodonFM
Version v1.0.0 released 28 Oct 2025
CodonFM Encodon models predict masked codons in mRNA sequences to enable variant effect interpretation and codon optimization. The model suite spans 80M to 1B parameters, with both randomly masked and codon frequency-aware masking schemes. Designed for mRNA design, expression optimization, variant interpretation, and synthetic biology use cases, the models process full coding sequences at codon resolution for research and development.
Developed By
Model Details
Model Architecture
The NVIDIA CodonFM Encodon family features Transformer-based architectures tailored for codon-level sequence modeling in mRNA. Each model applies a masked language modeling (MLM) objective to predict masked codons from surrounding context, enabling genome-scale codon optimization and synonymous variant interpretation. The models process sequences up to 2,046 codons (6,138 nucleotides) and output codon probability distributions for each position.
Parameters
Model Name | Parameters |
|---|---|
| Encodon-80M | 7.68 × 10⁷ |
| Encodon-600M | 6.09 × 10⁸ |
| Encodon-1B | 9.11 × 10⁸ |
| Encodon-Cdwt-1B | 9.11 × 10⁸ |
Model Card Authors
Laksshman Sundaram (NVIDIA)
Primary Contact Email
Laksshman Sundaram nv-codonfm@nvidia.com
To submit feature requests or report issues with the model, please open an issue on the GitHub repository.
System Requirements
Requires NVIDIA GPU (Ampere or Hopper architecture or newer) with CUDA support. Linux OS recommended. Optimized for use with CUDA libraries.
Model Variants
Model Variant Name | Description | Access URL |
|---|---|---|
| Encodon-80M | 80 million parameter BERT-style transformer model trained with random masking. | https://huggingface.co/nvidia/NV-CodonFM-Encodon-80M-v1 |
| Encodon-600M | 600 million parameter BERT-style transformer model trained with random masking. | https://huggingface.co/nvidia/NV-CodonFM-Encodon-600M-v1 |
| Encodon-1B | 1 billion parameter BERT-style transformer model trained with random masking. | https://huggingface.co/nvidia/NV-CodonFM-Encodon-1B-v1 |
| Encodon-Cdwt-1B | 1 billion parameter BERT-style transformer model trained with codon weighted masking. | https://huggingface.co/nvidia/NV-CodonFM-Encodon-Cdwt-1B-v1 |
Intended Use
Primary Use Cases
- mRNA Design (Cell Type-Specific): Optimize codon usage in therapeutic constructs.
- mRNA Stability Prediction: Estimate transcript degradation based on codon composition.
- Protein Yield Optimization: Maximize protein expression efficiency.
- Synonymous Variant Interpretation: Predict the functional impact of synonymous variants.
- Codon Bias Analysis: Study evolutionary patterns of codon preference.
- Synthetic Biology Applications: Tune codon usage for metabolic or synthetic pathways.
- Translation Regulation Studies: Explore codon effects on ribosome dynamics.
- Transfer Learning for Genomics: Fine-tune pretrained CodonFM models for specialized tasks.
Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Violating laws, regulations, or third-party rights (e.g. privacy, IP).
- Any use excluded under the NVIDIA Open Model License Agreement.
- Medical or clinical decision-making without biological validation.
- Generating synthetic sequences for therapeutic use without oversight.
- Use outside of research, academic, or preclinical contexts.
Training Details
Training Date
2024-04-15
Training Data
The models were trained using coding sequences from NCBI RefSeq (Release 2024-04).
- Size: 131,000,000 non-viral protein-coding sequences from >20,000 species, including >2,000 eukaryotes.
- Filtering: Excluded sequences not divisible by three, containing ambiguous codons, or pathogenic bacterial genes.
- Partitioning: Balanced by species within nine major taxonomic groups.
- Objective: Masked language modeling (MLM) with random or codon-weighted masking.
- Input Format: FASTA-derived memory maps.
- Dataset License: Public Domain.
Training Procedure
Training: The models were pretrained using a masked language modeling (MLM) objective, where the model predicts masked codons to learn contextual codon-interplay. The models were trained on over 130 million protein-coding sequences, processing trillions of tokens. The masking strategy can be adjusted to account for codon frequency observed in the pre-training corpus.
Data Curation and Pre-processing: The training corpus was curated from the NCBI RefSeq database, targeting reference assemblies of non-viral species. This raw data underwent a rigorous filtering and cleaning process:
-
Filtering: Sequences were removed if they (a) had a length not divisible by 3, or (b) contained any missing or ambiguous codons. Sequences originating from the same species were de-duplicated.
-
Corpus Refinement: Known human-pathogenic bacteria species were explicitly removed from the dataset.
This procedure resulted in a final dataset of >130,000,000 coding sequences from >20,000 distinct species.
Tokenization: The models are codon-resolution language models, meaning they process sequences as trinucleotide tokens (codons) rather than individual nucleotides. The Encodon models accept a maximum input sequence length of 2,046 codon tokens.
Training and Validation Split: To ensure a robust evaluation and prevent data leakage, the dataset was split using a stratified, cluster-based method:
-
Grouping: All sequences were first divided into 9 high-level groups based on species (e.g., primates, bacteria, fungi, plant, etc.).
-
Clustering: MMSeqs was used to cluster the coding sequences based on sequence similarity.
-
Splitting: The resulting clusters were then divided into training and validation sets. This split was stratified by species group to ensure that both sets had a representative distribution of organisms.
Training Code
Available via Github.
Data Sources
The NCBI RefSeq Protein-Coding Sequences dataset (Release 2024-04) was used as the primary data source for training the CodonFM models.
Performance Metrics
Evaluation Datasets
Task | Description |
|---|---|
| ClinVar variant interpretation | This task involves classifying genetic variants from ClinVar , a publicly available database that aggregates information about the clinical significance of human genetic variants, into pathogenic or benign categories based on their coding sequence context |
| Denovo variant classification | This task uses variants from the Deciphering Developmental Disorders (DDD) and autism spectrum disorder (ASD) cohort study, which catalogs genetic mutations linked to rare pediatric and developmental diseases, to evaluate classification of pathogenic versus benign variants based on coding sequence context. |
| mRNA translation efficiency | This task predicts ribosome profiling signal intensity along coding sequences, evaluating how well models capture translation efficiency and codon-level regulation from sequence context (see Zheng et al. 2024 for details). |
| Protein abundance | This task predicts fluorescent protein expression levels (mRFP) from coding sequences, testing how accurately models capture codon-dependent effects on translation efficiency and protein abundance (see Li et al. 2023 for details). |
Biases, Risks, and Limitations
Potential Biases
- The model may reflect biases present in the training data.
Risks
Areas of risk may include but are not limited to:
- Inaccurate outputs
- Potential misuse for incorrect biological interpretations
- Unregulated use in clinical or therapeutic contexts
- Potential misuse for generating synthetic sequences without ethical review
Limitations
- The models were strictly pre-trained on mRNA sequences. Therefore, the models are not expected to perform well with other types of CDS inputs.
- Any input sequence beyond 2046 codons requires truncation/windowing.
Caveats and Recommendations
- Review and validate outputs generated by the model.
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.
Acknowledgements
Developed by NVIDIA Corporation.