CELL-Diff
Version v1.0.0 released 18 Oct 2024
CELL-Diff is a diffusion transformer (DiT) model designed to generate detailed protein localization images from protein sequences (“sequence-to-image”) when given cell morphology images as conditional inputs (e.g., nucleus, ER, and microtubule images). Conversely, it can output protein sequences based on microscopy images depicting protein localization (“image-to-sequence”). The model is meant to investigate subcellular protein localization and protein interactions based on fluorescent microscopy images. Potential model applications include protein localization signal prediction, virtual staining for simultaneous localization and visualization of multiple proteins, and protein localization signal generation.
Model Details
Model Architecture
CELL-Diff is a diffusion transformer (DiT) model that integrates a continuous diffusion model for generating microscopy images and a discrete diffusion model for designing protein sequences within a unified framework. The model implements a U-Net architecture that includes three groups of downsampling modules, where image and protein embeddings are processed using an encoder-only transformer, and three upsampling modules. The downsampling blocks embed microscopy images into a latent sequence through residual and attention blocks, whereas protein sequences are embedded using a pre-trained ESM2 model. The transformer module consists of 24 layers with 8-head attention. Each downsampling and upsampling module contains two residual blocks and two attention blocks with channel sizes increasing from 64 to 512. The attention blocks utilize a cross-attention mechanism that integrates sequence information within the image processing component. CELL-Diff incorporates a diffusion time step by employing the adaptive layer norm zero conditioning method. To convert images into sequences, CELL-Diff uses the "patchify" operation with a patch size of 8 (see Peebles and Xie 2023 for adaptive layer norm zero conditioning method and patchify operation definitions).
Parameters
486 million
Citation
Dihan, Z. and Huang, B. (2024). CELL-Diff: Unified diffusion modeling for protein sequences and microscopy images. bioRxiv 2024.10.15.618585; doi:10.1101/2024.10.15.618585
Model Card Authors
Dihan Zheng
Model Card Contact
Dihan Zheng
Intended Use
Primary Use Cases
- Generation of subcellular protein localization images from protein sequence data
- Generation of protein sequences from microscopy images depicting protein localization
- Subcellular protein localization analysis
- Protein localization signal prediction
- Virtual staining of subcellular components and proteins
- Generation of protein localization signals
Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Use that violates applicable laws or regulations (including trade compliance laws)
- Any illegal activities
- Any use that is prohibited by the Acceptable Use Policy or MIT License
Training Details
Training methods described by Dihan and Huang 2024.
Training Data
CELL-Diff was pre-trained with subcellular localization images of 12,833 proteins from the HPA dataset (HPA trained model) and fine-tuned with 1,311 fluorescently-tagged protein images from the OpenCell dataset (OpenCell trained model) (see Pretrained Models). The immunofluorescence microscopy protein images from the HPA dataset included the localization of specific proteins relative to stained cell morphology markers including the nucleus, ER, and microtubules. Protein sequences for imaged proteins were obtained from UniProt. In total, the HPA training dataset included 88,483 data points, each containing a protein sequence and corresponding nucleus, ER, and microtubule images. The OpenCell dataset included 6,301 data points, each containing a protein sequence and corresponding protein and nucleus images.
Preprocessing
Image datasets were downloaded from The Human Protein Atlas and OpenCell platforms. Protein sequences for proteins in each dataset were retrieved from UniProt. Protein sequences exceeding 2048 amino acids in length were filtered out. The dataset was converted into an LMDB format for efficient training.
Training procedure
CELL-Diff requires four inputs, including the protein sequence, protein image, cell morphology image, and diffusion time step. Pre-training with the HPA dataset and fine-tuning with the OpenCell dataset were conducted with 100,000 iterations using the Adam optimizer. The learning rate was initialized using a linear warm-up strategy, increasing from 0 to 0.0003 over the first 1,000 iterations, followed by a linear decay to zero. The batch size was set to 192. Images from the HPA dataset were randomly cropped to 1024 x 1024 pixels, followed by resizing to 256 x 256 pixels. OpenCell images were randomly cropped to 256 x 256 pixels. The sequence embedding dimension from ESM2 was set to 640.
CELL-Diff is trained with 1,000 diffusion steps using shifted cosine noise schedules and denoising diffusion implicit models (DDIM) with 100 steps to accelerate sampling speed (see Hooge-boom et al., 2023 for noise schedules and Song et al., 2023 for DDIMs). The weighting coefficient (λ) for calculating loss was set to 100 and the maximum protein sequence length was 2,048.
Data Sources
The model was trained on the following types of datasets:
- Human Protein Atlas Subcellular Section
- OpenCell Microscopy Images
Performance Metrics
Metrics
The CELL-Diff HPA pre-trained model was evaluated using a range of benchmarks to measure image generation performance relative to CELL-E2. Key metrics include:
- Maximum Spatial Frequency (MSF) resolvability: This metric quantitatively compared CELL-Diff and CELL-E2 models based on their ability to resolve fine structural details in generated images. CELL-Diff generated images resulted in better MSF-resolvability than CELL-E2 images.
- Intersection over Union (IoU): This prediction accuracy metric measures mask similarity. For CELL-Diff, authors applied the median value thresholding to the original protein images to generate binary masks, whereas for CELL-E2 they used predicted thresholding images. CELL-Diff and CELL-E2 achieved comparable performance when using only the nucleus as the conditional cell image. However, accuracy was improved by incorporating all available cell morphology images (nucleus, ER, and microtubules) from the HPA dataset (only available for CELL-Diff).
- Frechet Inception Distance (FID) score: This learning-based metric evaluated the similarity between real and predicted images. To compute the FID score, protein and nucleus images were concatenated as input. FID-T and FID-O scores were computed based on thresholding and original protein images, respectively. Based on FID scores, CELL-Diff outperformed CELL-E2. CELL-Diff accurately predicted protein images from protein sequences that were not available in the training dataset.
Evaluation Datasets
- The evaluation was conducted using the HPA pre-trained model. The evaluation dataset included a subset of 100 proteins shared between the HPA and OpenCell datasets, with protein localization and cell morphology images from the HPA dataset. See HPA testing set: test_lmdb_dataset.
Evaluation Results
- Maximum Spatial Frequency (MSF) resolvability: 642.03
- Intersection over Union (IoU): 0.6228
- Frechet Inception Distance score of original image (FID-O): 23.35
- Frechet Inception Distance score of thresholding image (FID-T): 33.56
Evaluation Metrics URL
BoHuangLab/CELL-DiffBias, Risks, and Limitations
Potential Biases
- The model may exhibit biases present in the training data, particularly from underrepresented tissues, cell types, or ethnicities, leading to skewed predictions.
- Specific groups or conditions (e.g., rare diseases or minority populations) may be underrepresented in the dataset, impacting generalizability.
Risks
Areas of risk include but are not limited to:
- Hallucinations
- Incorrect prediction
Limitations
- The model's performance may be limited by the size of the training set.
Caveats and Recommendations
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.
Acknowledgements
This research is supported by the Chan Zuckerberg Biohub San Francisco Investigator program.