ESM3
Version v1.0.0 released 25 Jun 2024
License
ESM CambrianRepository
https://github.com/evolutionaryscale/esmThe ESM3 model family is a set of generative masked language models for proteins that can simultaneously reason across sequence, structure, and function.
Developed By
Model Details
Model Architecture
ESM3 is a multimodal masked generative language model that represents protein sequence, structure, and function as discrete token tracks that are embedded and processed jointly by a bidirectional transformer trunk. Users can prompt the model with partial sequence, structure, and function information, then iteratively sample masked positions until all positions are unmasked.
Transformer trunk:
We use a transformer architecture with the following improvements:
- We use Pre-LN, rotary embeddings and SwiGLU non-linearities.
- The first transformer block includes an SE(3)-invariant Geometric Attention layer which conditions on backbone atomic coordinates and encodes geometric relationships.
Input tokenization:
- Sequence tokens
- Structure tokens, tokenized by a VQ-VAE
- 8-class secondary structure (SS8)
- Solvent accessible surface area (SASA), discretized into 16 bins
- Function tokens: 8 tokens per residue representing functional keywords compressed via TF-IDF then Locality Sensitive Hashing (LSH)
- Residue-level InterPro annotations represented as multi-hot feature vectors (pruned vocabulary)
Structure tokenization and decoding:
A VQ-VAE encodes local structural neighborhoods (16 nearest neighbors) into structure tokens (codebook size: 4096). An all-atom structure decoder (700M parameters) reconstructs atomic coordinates from structure tokens.
Parameters
ESM3 was trained at multiple scales:
Model | Parameters | Layers | Training flops |
|---|---|---|---|
| ESM3-small | 1.4B | 48 | 1021 |
| ESM3-medium | 7B | 96 | 1022 |
| ESM3-large | 98B | 216 | 1024 |
Model Variants
Model Variant | Description | URL |
|---|---|---|
| ESM3 Open | Smallest variant, publicly released. Not trained on viral sequences. | https://forge.biohub.ai/ |
| ESM3 Small (Paper) | Smaller variant available through API | https://forge.biohub.ai/ |
| ESM3 Small (Overtrained) | Smaller variant available through API. Overtrained for better performance. | https://forge.biohub.ai/ |
| ESM3 Medium (Overtrained) | Medium-sized variant available through API. Overtrained for better performance. | https://forge.biohub.ai/ |
| ESM3 Large | Largest variant available through API | https://forge.biohub.ai/ |
Model Card Authors
Chetan Mishra and Neil Thomas (Biohub)
Citation
Hayes, T. et al. (2025). Simulating 500 million years of evolution with a language model. Science. DOI: 10.1126/science.ads0018.
Primary Contact Email
esm@biohub.orgTo submit feature requests or report issues with the model, please open an issue on the GitHub repository.
System Requirements
- Compute Requirements: GPU
- PyTorch environment with GPU support recommended.
Intended Use
Primary Use Cases
- Controllable protein generation and design: Prompt-driven generation of sequences and structures conditioned on motifs, partial coordinates, SS constraints, or function keywords. For example:
- Motif scaffolding: Scaffolding of active sites and preservation of geometric constraints.
- Designing novel fluorescent proteins: Diversifying a protein family while maintaining chromophore formation.
- Structure prediction: Predicting 3D protein structures from sequences.
- Inverse folding: Designing a protein sequence that will fold into a given target 3D structure.
- Representation learning: Embeddings derived from the multimodal trunk are useful for downstream supervised tasks.
- Drug discovery: Identifying and optimizing protein targets for therapeutic applications.
- Synthetic biology: Engineering proteins for industrial applications, environmental remediation, and biotechnology.
- Basic research: Understanding protein evolution, folding, and biological mechanisms.
Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Clinical diagnosis or treatment recommendations.
- Any use that violates applicable laws, regulations (including trade compliance laws), or third-party rights such as privacy or intellectual property rights.
- Any use that is prohibited by the model license.
- Any use that is prohibited by the Acceptable Use Policy
Training Data
ESM3 was trained on a large collection of natural sequences from public sequence and structure databases, including metagenomic sequences. This dataset was further augmented with synthetic data derived from predicted structures and predicted, inverse-folded, sequences. Cluster-based sampling was incorporated to limit training set redundancy.
Dataset Name | Unique samples (m) | Unique tokens (m) |
|---|---|---|
| PDB | 0.2 | 55 |
| UniRef | 133 | 40,177 |
| OAS | 203 | 22,363 |
| Metagenomic sequences | ||
| MGnify | 406 | 65,780 |
| JGI | 2,039 | 265,070 |
| Synthetic data | ||
| AFDB | 68 | 20,510 |
| ESMAtlas | 168 | 38,674 |
| AFDB (inverse folded) | 111 | 33,300 |
| ESMAtlas (inverse folded) | 251 | 57,730 |
| By modality | ||
| Sequence | 3,143 | 484,441 |
| Structure | 236 | 177,710 |
| Annotation | 539 | 105,957 |
| Total unique training tokens | 768,109 |
Training Procedure
Trained as a generative masked language model across sequence, structure tokens, SS8, SASA, function tokens, and residue annotation tracks. A variable masking (noise) schedule ensures the model learns to predict tokens from many masking rates and conditioning modalities, enabling flexible decoding and iterative sampling. ESM3 is trained with a noise schedule that balances generative capabilities with representation learning.
Training Code
Code and weights for ESM3-open are available at: https://github.com/evolutionaryscale/esm
Speeds, Sizes, Times
ESM3 was trained with 1.07×1024 FLOPs (over 1 trillion teraflops), representing more compute than any other known biological model at the time of release. Training utilized NVIDIA H100 Tensor Core GPUs. The model uses approximately 25x more FLOPs and 60x more data than its predecessor ESM2.
Performance Metrics
ESM3 was evaluated on a variety of tasks with associated metrics measuring sequence understanding, prompt responsiveness, single sequence structure prediction and alignment.
Metrics
- Per-track negative log-likelihood (NLL) averaged across mask rates.
- Structure prediction: Local distance difference test (LDDT) to ground truth structures.
- Generative quality: ESMFold pTM > 0.8 was used as a threshold to determine generated sequence quality.
- Prompt consistency: We measure the consistency between the prompt and the generation using constrained site RMSD (cRMSD), SS3 accuracy, SASA Spearman rho, and keyword recovery.
- Motif scaffolding: Pass@K measures the models ability, with K generations, to produce a design that passes the generation quality and prompt consistency metrics.
- Generative diversity: Pairwise sequence identity and TM-score to the training data.
Evaluation Datasets
- CAMEO test set: 902 proteins whose structures are temporally heldout from the training set. These targets were released from May 1 2020 to Aug 1, 2023. The cutoff for PDB training was May 1, 2020.
- Tertiary motif scaffolding benchmarks: The CAMEO test set was filtered for proteins of length < 1024, and prompts were constructed from the sequence, structure, and function tracks.
Evaluation Results
- Per-track negative log-likelihood (NLL) averaged across mask rates show that ESM3 performs well on sequence, structure, and function recovery. In addition, ESM3 is responsive to conditioning, i.e. providing one single track reduces perplexity on other tracks. Both performance and responsiveness improve with scale.
- Structure prediction: ESM3 98B surpasses ESMFold on single-sequence structure prediction on the CAMEO test set (mean LDDT 0.880 vs 0.861).
- Prompt consistency: Across all tracks, the 7B parameter ESM3 finds solutions that follow the prompt and have structures which are confidently predicted by ESMFold (pTM > 0.8)
- Generative quality & diversity: Unconditional generations from ESM3 98B have mean pLDDT of 0.84, and pTM of 0.52. They exhibit diverse coverage of sequence space, with low identity to nearest training neighbors with mean pairwise sequence identity 0.155, and pairwise TM score 0.48.
- Motif Scaffolding with Alignment: Preference-based alignment substantially increases Pass@128 success rates, which further improve with scale.
Experimental validation - esmGFP
ESM3 was used to generate a novel GFP (esmGFP) which had 36% identity to avGFP and 58% identity to the closest known fluorescent protein (tagRFP). esmGFP exhibited brightness after maturation, and spectral properties similar to EGFP. This demonstrates ESM3's ability to generate functional proteins far from natural sequences.
(See the ESM3 paper for full validation details.)
Biases, Risks, and Limitations
Potential Biases
- Dataset bias: Over- or under-representation of taxa, protein families, or ecological niches in public sequence and structure databases influences generalization and can bias outputs. This is partially mitigated by clustering-based, nonredundant sampling.
- Annotation bias: InterPro/GO coverage and curation quality can vary.
- Synthetic data bias: Synthetic folded structures and inverse-folding sequences augment diversity but can introduce distributional shifts relative to natural sequences.
Risks
- Biosafety: Novel sequence generation can lead to designs with hazardous properties.
- Hallucination / non-physical outputs: Model proposals may not be physically realizable; pLDDT/pTM are helpful but imperfect.
- Reliance on in-silico metrics: Computational metrics do not replace wet-lab validation.
Limitations
- Compute & reproducibility: Reproducing large-scale training (98B) is expensive and resource intensive; inference at this scale is also costly.
- Context Window: ESM3 has a context window limit of 2048 tokens.
- Generalization and overfitting: ESM3 can exhibit poor performance when modifying sequences on which it is highly confident.
Caveats and Recommendations
- Review and validate outputs generated by the model.
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.
- Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com, respectively.
Acknowledgements
Please refer to the ESM3 paper for acknowledgements.