SubCell
Version v1.0 released 18 Nov 2024
- Ankit Gupta (Emma Lundberg Lab, Chan Zuckerberg Initiative AI Resident)
The SubCell models are Vision Transformer (ViT) models pretrained on the single-cell images of the Human Protein Atlas (HPA) dataset containing protein expression and spatial distribution of more than 13,000 genes in 37 cell lines. The models generate feature embeddings that encode the protein localization patterns in the immunofluorescence images and can be used in downstream tasks such as protein localization classification or morphology-based profiling of the cells.
Model Details
Model Architecture
The ViT module comprises 12 layers with 12 attention heads and a hidden feature size of 768 with a patch size of 16. The models were trained in a self-supervised manner with different pretext tasks to learn a general representation of the cells. We introduced a protein-specific pretext task that encourages the models to learn the protein-specific localization patterns in the cells. We experimented with different combinations of the pretext tasks, including contrastive learning and combining them with Masked Autoencoder (MAE) reconstruction loss, and presented two best-performing models: ViT-ProtS-Pool, trained with only protein-specific loss and the MAE-CellS-ProtS-Pool, trained with the multi-task objective of reconstruction, cell-specific and protein-specific tasks. We also trained models on different channel combinations to accommodate broader use cases.
Accompanying each model above, we also trained ten multi-layer perceptron (MLP) classifiers on the feature embeddings extracted from the frozen models using field-of-view image-level localization annotation in HPA. These classifiers can predict protein localizations from the feature embeddings and generate additional features, which were then used to benchmark SubCell models with other existing similar models.
Parameters
87.3M
Model Variants
The models are prefixed with what fluorescent channels were used for training, which also indicates what channels are
expected when using the trained model to compute embeddings and other outputs. For example, all_channels_
used all
four fluorescent channels including protein, microtubule (MT), endoplasmic reticulum (ER) and DNA. Other channel
combinations include ER-DNA-Protein_
, DNA-Protein_
, and MT-DNA-Protein_
.
For each channel combination, two best-performing models ViT-ProtS-Pool
and MAE-CellS-ProtS-Pool
are provided,
alongside their ten corresponding MLP classifiers each starting from a different random seed.
Models can be found in s3://czi-subcell-public/models/
with the following file structure:
all_channels_MAE-CellS-ProtS-Pool.pth
all_channels_MAE_MLP_classifier/classifier_seed_0.pth (10 files in this folder)
all_channels_ViT-ProtS-Pool.pth
all_channels_ViT_MLP_classifier/classifier_seed_0.pth (10 files in this folder)
ER-DNA-Protein_MAE-CellS-ProtS-Pool.pth
ER-DNA-Protein_MAE_MLP_classifier/classifier_seed_0.pth (10 files in this folder)
ER-DNA-Protein_ViT-ProtS-Pool.pth
ER-DNA-Protein_ViT_MLP_classifier/classifier_seed_0.pth (10 files in this folder)
...
Citation
SubCell: Vision foundation models for microscopy capture single-cell biology. Ankit Gupta, Zoe Wefers, Konstantin Kahnert, Jan N Hansen, William D. Leineweber, Anthony Cesnik, Dan Lu, Ulrika Axelsson, Frederic Ballllosera Navarro, Theofanis Karaletsos, Emma Lundberg. bioRxiv 2024.12.06.627299; doi: https://doi.org/10.1101/2024.12.06.627299.
Model Card Author
Ankit Gupta (Emma Lundberg Lab, Chan Zuckerberg Initiative AI Resident)
Model Card Contact
virtualcellmodels@chanzuckerberg.comIntended Use
Primary Use Cases
- Protein localization analysis.
- Cell morphology profiling.
- Feature extraction for a single-cell representation in fluorescence microscopy images.
Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Any use that is prohibited by the Acceptable Use Policy or MIT License
Intended Users
- Researchers & Scientists.
Training Details
Training Data
The immunofluorescence data containing four channels (marker channels- red/microtubules, yellow/ER, blue/nuclei, and protein channel-green/protein of interest) from HPA Subcellular section Data, v23 (https://v23.proteinatlas.org/humanproteome/subcellular), is used to train the models. Using the marker channels, single-cell masks were generated using the HPACellSegmentator
https://github.com/CellProfiling/HPA-Cell-Segmentationmodel. The dataset contained 1.1 million cell images with annotations in 35 organelles and subcellular structures.
Preprocessing
The single-cell crops of size 896x896 were extracted and downsampled to 448x448 for the training. The intensity of the crops was normalized between 0 and 1 (considering all four channels).
Training Procedure
The models were trained on the images using a self-supervised learning manner with pretext tasks to improve the learning of protein localization patterns in the images. Model training scripts are available at https://github.com/CellProfiling/subcell-embed.
Training Hyperparameters
fp32
Data Sources
Human Protein Atlas Subcellular Section v23
Performance Metrics
Metrics
The models were evaluated on various datasets to measure the quality of protein localization and cellular morphology.
- Protein Localization task
- The protein localization performance was evaluated using micro and macro average precision on the HPAv23 and Kaggle challenge test sets. We also used label ranking average precision and coverage error to measure the multi-localizing performance.
- We also performed unsupervised clustering on the single-cell embeddings from the OpenCell dataset. We evaluated the performance compared to the original annotations of the models using the adjusted Rand index, v-measure score, and Fowles-Mallowd score.
- Cell Morphology Profiling
- For the HPA datasets, we assessed the cell morphology performance of the models by evaluating them on cell line prediction using both macro and micro average precisions.
- We also evaluated the morphological profiling of the models on the JUMP1 dataset on replicate retrieval and MoA identification. We used mean average precision and NN accuracy as the metrics.
Evaluation Datasets
The evaluation of the models was performed on the following datasets:
- HPAv23 test set.
- Kaggle test set. (https://www.kaggle.com/competitions/hpa-single-cell-image-classification)
- OpenCell dataset
- JUMP dataset (https://cellpainting-gallery.s3.amazonaws.com/index.html#cpg0000-jump-pilot/)
Evaluation Results
Bias, Risks, and Limitations
Caveats and Recommendations
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.
- Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.
Acknowledgements
Training and evaluation of SubCell models was enabled by the supercomputing resource provided by the Chan Zuckerberg Initiative and the National Supercomputer Centre at Linköping University. Funding for the research was provided by the Chan Zuckerberg Initiative and the Knut and Alice Wallenberg Foundation.