Try Models

ESMC

Version v1.0.0 released 25 Jun 2024

The ESM Cambrian (ESMC) model family is a next generation language model trained on protein sequences at the scale of life on Earth. ESMC models define a new state of the art for protein representation learning.

Developed By

EvolutionaryScale

Model Details

Model Architecture

ESMC is based on the transformer architecture. It features Pre-LN, rotary embeddings, and SwiGLU activations. No biases are used in linear layers or layer norms.

Parameters

ESMC was trained at multiple scales:

Model
Parameters
Layers
Training flops
ESMC-300M300M301x1022
ESMC-600M600M362x1022
ESMC-6B6B802x1023

Model Variants

Model Variant
Description
URL
ESMC 300MSmallest variant, publicly released.https://huggingface.co/EvolutionaryScale/esmc-300m-2024-12
ESMC 600MMedium variant, publicly released.https://huggingface.co/EvolutionaryScale/esmc-600m-2024-12
ESMC 6BLarge variant, available via APIhttps://forge.biohub.ai/

Model Card Authors

Chetan Mishra and Neil Thomas (Biohub)

Citation

ESM Team. "ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning." EvolutionaryScale Website, December 4, 2024. https://evolutionaryscale.ai/blog/esm-cambrian

Primary Contact Email

esm@biohub.org

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

  • Compute Requirements: GPU
  • PyTorch environment with GPU support recommended.

Intended Use

Primary Use Cases

  • Protein representation learning and embeddings: Creating representations that capture the full evolutionary scale and underlying biology of proteins, enabling their application in downstream machine learning tasks.
  • Transfer learning applications: Generating protein embeddings that can be fine-tuned for various downstream prediction tasks including functional annotation, mutational effect analysis, and the design of novel proteins and peptides.
  • Variant effect prediction: Predicting the functional impact of mutations and amino acid substitutions on protein function.

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

  • Clinical diagnosis or treatment recommendations.
  • Any use that violates applicable laws, regulations (including trade compliance laws), or third-party rights such as privacy or intellectual property rights.
  • Any use that is prohibited by the model license.
  • Any use that is prohibited by the Acceptable Use Policy

Training Data

ESMC was trained on protein sequences from UniRef, MGnify, and the Joint Genome Institute (JGI). Sequence data was clustered at 70% sequence identity, resulting in 83M, 372M, and 2B clusters for UniRef, MGnify, and JGI, respectively.

Training Procedure

Training was conducted in two stages:

  • Stage 1: For the first 1 million steps, the model used a context length of 512, with metagenomic data constituting 64% of the training dataset.
  • Stage 2: In the final 500,000 steps, the context length was increased to 2048, and the proportion of metagenomic data was reduced to 37.5%.

Performance Metrics

Performance metrics are detailed on our blog announcing ESMC: https://www.evolutionaryscale.ai/blog/esm-cambrian.

Biases, Risks, and Limitations

Potential Biases

  • Dataset bias: Over- or under-representation of taxa, protein families, or ecological niches in public sequence and structure databases influences generalization and can bias outputs. This is partially mitigated by clustering-based, nonredundant sampling.

Risks

  • Biosafety: Novel sequence generation can lead to designs with hazardous properties.

Limitations

  • Context window: ESMC has a context window limit of 2048 tokens.
  • Reliance on in-silico metrics: Computational metrics do not replace wet-lab validation.

Caveats and Recommendations

  • Review and validate outputs generated by the model.
  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.
  • Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com, respectively.