CELL-Diff

Version v2.0.0 released 05 Aug 2025

Developed By

  • Zheng Dihan
  • Bo Huang

CELL-Diff is a latent diffusion model designed to generate detailed protein localization images from protein sequences (“sequence-to-image”) when given cell morphology images as conditional inputs (e.g., nucleus, ER, and microtubule images). Conversely, it can output protein sequences based on microscopy images depicting protein localization (“image-to-sequence”). The model is meant to investigate subcellular protein localization and protein interactions based on fluorescent microscopy images. Potential model applications include protein localization signal prediction, virtual staining for simultaneous localization and visualization of multiple proteins, and protein localization signal generation.

Model Details

Model Architecture

CELL-Diff is a latent diffusion model that unifies continuous diffusion for microscopy image generation and discrete diffusion for protein sequence design. Images are first encoded and downsampled into a latent space using a Variational Auto-Encoder (VAE). The unified diffusion model operates on four inputs: the protein sequence, the protein image, the cell image, and the diffusion time step. To process the protein and cell images, the two inputs are concatenated and passed through a U-Net architecture. This concatenated representation is processed by a series of downsampling blocks to generate image embeddings. Protein sequences are embedded using a pre-trained ESM2 model with frozen parameters. The resulting protein and image embeddings are concatenated and input into a bidirectional transformer composed of 8 layers with 8-head self-attention. The transformer output is then separated into image and sequence feature tensors. The image features are upsampled and combined with the corresponding downsampling features to predict the noise added to the protein image, and the sequence features are processed using a linear projection layer to recover masked amino acids. The U-Net architecture includes both residual and attention blocks. To better integrate protein sequence information within the image modeling pipeline, CELL-Diff incorporates cross-attention mechanisms adapted from Stable Diffusion. To convert images into sequences, CELL-Diff uses the "patchify" operation with a patch size of 1 (see Peebles and Xie 2023 for adaptive layer norm zero conditioning method and patchify operation definitions).

Parameters

1 billion

Citation

Zheng, D. and Huang, B. (2025) Bridging Protein Sequences and Microscopy Images with Unified Diffusion Models. Forty-second International Conference on Machine Learning. OpenReview

Model Card Authors

Dihan Zheng

Primary Contact Email

Dihan Zheng Dihan.Zheng@ucsf.edu

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

Intended Use

Primary Use Cases

  • Generation of subcellular protein localization images from protein sequence data
  • Generation of protein sequences from microscopy images depicting protein localization
  • Subcellular protein localization analysis
  • Protein localization signal prediction
  • Virtual staining of subcellular components and proteins
  • Generation of protein localization signals

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

  • Use that violates applicable laws or regulations (including trade compliance laws)
  • Any illegal activities
  • Any use that is prohibited by the Acceptable Use Policy or MIT License

Training Details

Training methods described by Dihan and Huang 2025.

Training Data

CELL-Diff was pre-trained with subcellular localization images of 12,833 proteins from the HPA dataset (HPA trained model) and fine-tuned with 1,311 fluorescently-tagged protein images from the OpenCell dataset (OpenCell trained model) (see Pretrained Models). The immunofluorescence microscopy protein images from the HPA dataset included the localization of specific proteins relative to stained cell morphology markers including the nucleus, ER, and microtubules. Protein sequences for imaged proteins were obtained from UniProt. In total, the HPA training dataset included 88,483 data points, each containing a protein sequence and corresponding nucleus, ER, and microtubule images. The OpenCell dataset included 6,301 data points, each containing a protein sequence and corresponding protein and nucleus images.

Preprocessing

Image datasets were downloaded from The Human Protein Atlas and OpenCell platforms. Protein sequences for proteins in each dataset were retrieved from UniProt. Protein sequences exceeding 2048 amino acids in length were filtered out. The dataset was converted into an LMDB format for efficient training. The preprocessed dataset in LMDB format can be downloaded from AWS (see download command in GitHub.

Training procedure

CELL-Diff requires four inputs, including the protein sequence, protein image, cell morphology image, and diffusion time step. Pre-training with the HPA dataset and fine-tuning with the OpenCell dataset were conducted with 100,000 iterations using the Adam optimizer. The learning rate was initialized using a linear warm-up strategy, increasing from 0 to 0.0001 over the first 1,000 iterations, followed by a linear decay to zero. The batch size was set to 64. Images from the HPA dataset were randomly cropped to 1024 x 1024 pixels, followed by resizing to 256 x 256 pixels. OpenCell images were randomly cropped to 256 x 256 pixels. The sequence embedding dimension from ESM2 was set to 1280.

CELL-Diff is trained with 200 diffusion steps using cosine noise schedules and denoising diffusion implicit models (DDIM) with 100 steps to accelerate sampling speed (see Peebles and Xie 2023 for noise schedules and Song et al., 2023 for DDIMs). The weighting coefficient (λ) for calculating loss was set to 1 and the maximum protein sequence length was 2,048.

Training Code

Pretraining and fine-tuning scripts are found in GitHub.

Data Sources

The model was trained on the following datasets:

The preprocessed dataset can be downloaded from AWS (see download command in GitHub).

Performance Metrics

Metrics

The CELL-Diff HPA pre-trained model was evaluated using a range of benchmarks to measure image generation performance relative to CELL-E2. Key metrics include:

  • Maximum Spatial Frequency (MSF) resolvability: This metric quantitatively compared CELL-Diff and CELL-E2 models based on their ability to resolve fine structural details in generated images. CELL-Diff generated images resulted in better MSF-resolvability than CELL-E2 images.
  • Intersection over Union (IoU): This prediction accuracy metric measures mask similarity. For CELL-Diff, authors applied the median value thresholding to the original protein images to generate binary masks, whereas for CELL-E2 they used predicted thresholding images. CELL-Diff and CELL-E2 achieved comparable performance when using only the nucleus as the conditional cell image. However, accuracy was improved by incorporating all available cell morphology images (nucleus, ER, and microtubules) from the HPA dataset (only available for CELL-Diff).
  • Frechet Inception Distance (FID) score: This learning-based metric evaluated the similarity between real and predicted images. To compute the FID score, protein and nucleus images were concatenated as input. FID-T and FID-O scores were computed based on thresholding and original protein images, respectively. Based on FID scores, CELL-Diff outperformed CELL-E2. CELL-Diff accurately predicted protein images from protein sequences that were not available in the training dataset.

Evaluation Datasets

  • The evaluation was conducted using the HPA pre-trained model. The evaluation dataset included a subset of 100 proteins shared between the HPA and OpenCell datasets, with protein localization and cell morphology images from the HPA dataset. The command to download the preprocessed HPA training dataset from AWS, including train/test split, is available on GitHub.

Evaluation Results

  • Maximum Spatial Frequency (MSF) resolvability: 644
  • Intersection over Union (IoU): 0.635
  • Frechet Inception Distance score of original image (FID-O): 45.6
  • Frechet Inception Distance score of thresholding image (FID-T): 50.4

Evaluation Metrics URL

BoHuangLab/CELL-Diff

Biases, Risks, and Limitations

Potential Biases

  • The model may exhibit biases present in the training data, particularly from underrepresented tissues, cell types, or ethnicities, leading to skewed predictions.
  • Specific groups or conditions (e.g., rare diseases or minority populations) may be underrepresented in the dataset, impacting generalizability.

Risks

Areas of risk include but are not limited to:

  • Hallucinations
  • Incorrect prediction

Limitations

  • The model's performance may be limited by the size of the training set.

Caveats and Recommendations

  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.

Acknowledgements

This research is supported by the Chan Zuckerberg Biohub San Francisco Investigator program.