SubCell

Version v1.0 released 18 Nov 2024

Developed By
  • Ankit Gupta (Emma Lundberg Lab, Chan Zuckerberg Initiative AI Resident)

The SubCell models are Vision Transformer (ViT) models pretrained on the single-cell images of the Human Protein Atlas (HPA) dataset containing protein expression and spatial distribution of more than 13,000 genes in 37 cell lines. The models generate feature embeddings that encode the protein localization patterns in the immunofluorescence images and can be used in downstream tasks such as protein localization classification or morphology-based profiling of the cells.

Model Details

Model Architecture

The ViT module comprises 12 layers with 12 attention heads and a hidden feature size of 768 with a patch size of 16. The models were trained in a self-supervised manner with different pretext tasks to learn a general representation of the cells. We introduced a protein-specific pretext task that encourages the models to learn the protein-specific localization patterns in the cells. We experimented with different combinations of the pretext tasks, including contrastive learning and combining them with Masked Autoencoder (MAE) reconstruction loss, and presented two best-performing models: ViT-ProtS-Pool, trained with only protein-specific loss and the MAE-CellS-ProtS-Pool, trained with the multi-task objective of reconstruction, cell-specific and protein-specific tasks. We also trained models on different channel combinations to accommodate broader use cases.

Accompanying each model above, we also trained ten multi-layer perceptron (MLP) classifiers on the feature embeddings extracted from the frozen models using field-of-view image-level localization annotation in HPA. These classifiers can predict protein localizations from the feature embeddings and generate additional features, which were then used to benchmark SubCell models with other existing similar models.

Parameters

87.3M

Model Variants

The models are prefixed with what fluorescent channels were used for training, which also indicates what channels are expected when using the trained model to compute embeddings and other outputs. For example, all_channels_ used all four fluorescent channels including protein, microtubule (MT), endoplasmic reticulum (ER) and DNA. Other channel combinations include ER-DNA-Protein_, DNA-Protein_, and MT-DNA-Protein_.

For each channel combination, two best-performing models ViT-ProtS-Pool and MAE-CellS-ProtS-Pool are provided, alongside their ten corresponding MLP classifiers each starting from a different random seed.

Models can be found in s3://czi-subcell-public/models/ with the following file structure:

all_channels_MAE-CellS-ProtS-Pool.pth
all_channels_MAE_MLP_classifier/classifier_seed_0.pth (10 files in this folder)
all_channels_ViT-ProtS-Pool.pth
all_channels_ViT_MLP_classifier/classifier_seed_0.pth (10 files in this folder)
ER-DNA-Protein_MAE-CellS-ProtS-Pool.pth 
ER-DNA-Protein_MAE_MLP_classifier/classifier_seed_0.pth (10 files in this folder)
ER-DNA-Protein_ViT-ProtS-Pool.pth
ER-DNA-Protein_ViT_MLP_classifier/classifier_seed_0.pth (10 files in this folder)
...

Citation

SubCell: Vision foundation models for microscopy capture single-cell biology. Ankit Gupta, Zoe Wefers, Konstantin Kahnert, Jan N Hansen, William D. Leineweber, Anthony Cesnik, Dan Lu, Ulrika Axelsson, Frederic Ballllosera Navarro, Theofanis Karaletsos, Emma Lundberg. bioRxiv 2024.12.06.627299; doi: https://doi.org/10.1101/2024.12.06.627299.

Model Card Author

Ankit Gupta (Emma Lundberg Lab, Chan Zuckerberg Initiative AI Resident)

Model Card Contact

virtualcellmodels@chanzuckerberg.com

Intended Use

Primary Use Cases

  • Protein localization analysis.
  • Cell morphology profiling.
  • Feature extraction for a single-cell representation in fluorescence microscopy images.

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Intended Users

  • Researchers & Scientists.

Training Details

Training Data

The immunofluorescence data containing four channels (marker channels- red/microtubules, yellow/ER, blue/nuclei, and protein channel-green/protein of interest) from HPA Subcellular section Data, v23 (https://v23.proteinatlas.org/humanproteome/subcellular), is used to train the models. Using the marker channels, single-cell masks were generated using the HPACellSegmentator

https://github.com/CellProfiling/HPA-Cell-Segmentation

model. The dataset contained 1.1 million cell images with annotations in 35 organelles and subcellular structures.

Preprocessing

The single-cell crops of size 896x896 were extracted and downsampled to 448x448 for the training. The intensity of the crops was normalized between 0 and 1 (considering all four channels).

Training Procedure

The models were trained on the images using a self-supervised learning manner with pretext tasks to improve the learning of protein localization patterns in the images. Model training scripts are available at https://github.com/CellProfiling/subcell-embed.

Training Hyperparameters

fp32

Data Sources

Human Protein Atlas Subcellular Section v23

Performance Metrics

Metrics

The models were evaluated on various datasets to measure the quality of protein localization and cellular morphology.

  • Protein Localization task
    • The protein localization performance was evaluated using micro and macro average precision on the HPAv23 and Kaggle challenge test sets. We also used label ranking average precision and coverage error to measure the multi-localizing performance.
    • We also performed unsupervised clustering on the single-cell embeddings from the OpenCell dataset. We evaluated the performance compared to the original annotations of the models using the adjusted Rand index, v-measure score, and Fowles-Mallowd score.
  • Cell Morphology Profiling
    • For the HPA datasets, we assessed the cell morphology performance of the models by evaluating them on cell line prediction using both macro and micro average precisions.
    • We also evaluated the morphological profiling of the models on the JUMP1 dataset on replicate retrieval and MoA identification. We used mean average precision and NN accuracy as the metrics.

Evaluation Datasets

The evaluation of the models was performed on the following datasets:

Evaluation Results

Performance Metrics table and graphicsPerformance Metrics table

Bias, Risks, and Limitations

Caveats and Recommendations

Acknowledgements

Training and evaluation of SubCell models was enabled by the supercomputing resource provided by the Chan Zuckerberg Initiative and the National Supercomputer Centre at Linköping University. Funding for the research was provided by the Chan Zuckerberg Initiative and the Knut and Alice Wallenberg Foundation.