Evaluating SubCell and Related Imaging Models

Overview of benchmarking tasks and challenges for evaluating deep learning models for single-cell fluorescence microscopy images.

Dan Lu | March 27, 2025

We need to understand how learned representations capture biology

Cellular imaging data is a crucial component to build virtual cell models (1). Through microscopy, we observe the structure, organization, morphology, and dynamics of cells, which are the very foundations of how cells operate and become alive. Changes in these properties under various conditions or perturbations further tell us about the regulation of cellular processes and how they can be modified by external factors such as drugs.

The AI Virtual Cell provides a Universal Representation of a cell state that can be obtained across species and conditions, and generated from different data modalities across scales (molecular, cellular, multicellular).

Figure 1a from How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities https://arxiv.org/abs/2409.11654

Machine learning (ML) models can facilitate large-scale data analysis for microscopy images and uncover information that is not obvious to the human eye. SubCell (2) belongs to a group of deep learning models that extract features from single-cell fluorescence microscopy images and represent them as high-dimensional embeddings. The embeddings capture the similarities and differences between single cells and are not constrained by or necessarily correlated with measurable or interpretable quantities human annotators often focus on, such as cell size or shape. It is, therefore, a critical step in model development to evaluate how well the embeddings capture useful information in the images and how well they perform in real biological use cases, both independently and in comparison to those generated by other methods.

Benchmarking is critical in evaluating the performance of a model, its applicability in biology and how it advances the field. Here, we provide a high-level overview of the benchmarking methods used by SubCell and similar models, suggest important considerations for interpreting benchmarking results, and call for community collaboration to improve benchmarking practices and utility.

Single-cell fluorescence microscopy models and datasets

SubCell (2) was benchmarked against a few previously developed models including cytoself from the Chan Zuckerberg Biohub (3) and DINO4Cells from the Broad Institute and collaborators (4). All these models were developed to analyze single-cell fluorescence microscopy images. Cytoself was trained on the OpenCell dataset (5) and was able to capture similarities in protein localizations in OpenCell and Allen Cell (6) datasets which matched previous knowledge of protein complexes or pathways. DINO4Cells had three versions of models each trained with a different dataset (Human Protein Atlas subcellular section (7), Allen Cell (8), a collection of Cell Painting datasets (4)) and then used to uncover properties in that specific dataset. SubCell was developed to be even more generalizable across different datasets. It was trained with the Human Protein Atlas (HPA) subcellular section (7) but can be applied to other datasets it did not see during training with varying experimental set up and image resolutions.

Dataset	Description	Model: Subcell	Model: DINO4Cells	Model: cytoself
Human Protein Atlas Subcellular section	Immunofluorescence microscopy images of single cells. 4 channels: protein of interest, endoplasmic reticulum, microtubule, nucleus. 13,147 proteins. 37 human cell lines.	Training Data	Training Data
OpenCell	Endogenous GFP tagging. 2 channels: protein of interest, nucleus. 1310 proteins. 1 cell line.	Evaluation Data		Training Data
Allen Cell	Endogenous GFP tagging. 3 channels: protein of interest, nuclei, cell membrane. 25 proteins each represent a cellular structure. 1 parent cell line.		Training Data	Evaluation Data
Cell Painting - cpg0000-jump-pilot	Immunofluorescence. 5 channels: nucleus, RNA, actin and Golgi and plasma membrane, endoplasmic reticulum, mitochondria. 2 cell lines.	Evaluation Data
Cell Painting - LINCS, BBBC036, and a combined single-cell resource (4)	Immunofluorescence. 5 channels: nucleus, RNA, actin and Golgi and plasma membrane, endoplasmic reticulum, mitochondria. 2 cell lines.		Training Data

Predicting metadata, clustering on known biology, and more

Benchmarking evaluates performance and demonstrates the utility of models. For models like SubCell which aim to capture biological properties and reveal biological insights from the data, the most useful type of benchmarking is to determine whether the models can uncover biological knowledge or match annotations that are previously known but not part of the training data. The benchmarking tasks used by SubCell, cytoself and DINO4Cells can be grouped into 3 general categories:

Metadata label prediction: how effectively can we recover information about cells from the embedding space?

This benchmarking task evaluates how well embeddings generated by the model captures metadata information that indicates certain differences or similarities in the input image. Each data point in the embedding space, which often corresponds to a single cell or an aggregate of multiple cells, carries metadata with them. In the case of HPA data (training data for SubCell and DINO4Cells), each cell is from a particular cell line with a particular protein labeled for visualization. Each image, either a full field-of-view image or a single-cell crop also has protein location annotation that was annotated manually by human experts (7). In the case of Allen Cell WTC-11 hiPSC dataset (training data for DINO4Cells), each cell has protein localization label dictated by which protein structure was labeled by GFP, as well as annotation for cell cycle stages (8). A secondary and simple ML model is often trained to predict these metadata labels (ground truth) from the embedding, and corresponding metrics such as F1-score, accuracy, precision etc. can be computed and compared across various models. This process converts the benchmarking into a classification problem which is relatively straightforward to evaluate the performance and compute statistics for. Examples see

SubCell preprint Fig. 2, 4, 6.

Cell Clustering: to what extent does the embedding space reflect known biological relationships?

This benchmarking task evaluates how well the differences or similarities among cells captured by embedding correlate with biological knowledge or metadata. For example, do cells close to each other in the embedding space have similar protein localization, or do proteins in the same complex or pathway tend to be close to each other in the embedding space? Dimensionality reduction techniques such as UMAP (9) is used in this step to facilitate visualization of the high-dimensional embedding in 2D space. The UMAP results are often shown as a scatter plot coloring by biological metadata, such as protein localization labels, protein complexes, or treatment labels. How well the embeddings correlate with data grouping by biological knowledge or metadata (ground truth) can thus be visually inspected. Additionally, statistics can be computed to measure how well the clusters defined based on the UMAP match ground truth, for example do proteins known to form a complex belong to the same cluster. This is an effective way to evaluate the model outputs in biological context and potentially generate hypotheses to make new discoveries. Examples see

SubCell preprint Fig. 4, 5, 7.

Dataset or model-specific tasks

Some datasets and models have unique properties that allow extra evaluations of the model performance. For example the cell lines used in HPA subcellular dataset have RNA-seq data available. So both SubCell and DINO4Cells were able to compare the RNA-seq profile with the embeddings learned from imaging data and show some level of similarities. Both these models also used the vision transformer architecture (10) which generates attention maps to reveal where in the cell the models were paying attention to. For example, some of the attention heads capture signals in the nuclei, some capture signals in the cytosol. These provide confidence that the model is focusing on the cellular structures of interests rather than noises. Cytoself has discrete representations of the latent space which could be then clustered to show how they correlate with actual cellular structures. All these analyses further provide confidence into the model and connect their outputs with biologically meaningful features. Examples see

SubCell preprint Fig. 3.

Considerations when interpreting benchmarking results

It is important to keep in mind certain limitations while interpreting the benchmarking results.

Unique strengths of models

Often, the unique strengths of a model do not exist in other models and cannot be easily compared with other models. For example, DINO4Cells has a general architecture that can be trained with different datasets to learn specific properties for each of them, while cytoself and SubCell was trained with one dataset but can be used for inference with new datasets it has never seen before. If the purpose of benchmarking is to compare models side by side for the same tasks, what is unique about each model is difficult to evaluate through benchmarking.

A comprehensive view of a model can only be formed by taking into account all of the following:

the benchmarking results against other models, as detailed above;
the fact that each benchmark task evaluates only one aspect of a model, and the results from all benchmarks need to be considered collectively;
the unique strengths of the model that cannot be easily benchmarked against other models.

Compatibility of models with datasets

Another ideal setup for benchmarking is to evaluate the models with the same input data and ground truth, so the results are comparable and the comparison is fair. One challenge in this aspect is the diversity of imaging data, which can come with varying channel combinations, pixel resolutions, and in either 2D or 3D stacks. Meanwhile, a model often expects specific configurations of input data, and as a result not all models can easily work with all datasets. For example, the DINO4Cells model trained with Allen Cell is only used to analyze Allen Cell data; cytoself is trained with OpenCell with protein and nuclei channels and will not be able to make use of the additional channels in Cell Painting data. Additionally, before feeding benchmarking data into a model, the data needs to be preprocessed separately and specifically for each model. For example, in order to run two datasets through two models, data preprocessing must be performed four times, which is not a trivial amount of work given that imaging datasets are usually large in size.

Generalizability of the benchmarking tasks

SubCell represents a small group of ML models developed to analyze single-cell fluorescence microscopy images, and all the benchmarking tasks and metrics were presented to support this goal. Similar tasks and metrics are also applicable to single-cell transcriptomic models, where single-cell RNA-seq data are converted into embeddings and can be evaluated using metadata label prediction and cell clustering tasks. However, different types of imaging ML models require tasks and metrics specific to their goals. For example, feature extraction models are also used to analyze high-content screening datasets, but in those cases, batch correction is a critical step, and the models are evaluated accordingly (11,12).

Metric selection

The computation of benchmarking metrics often has nuanced considerations that are important to keep in mind while interpreting the resulting numbers. Previous work has shown that slight twists of the metrics could dramatically change the ranking of Kaggle competition models (13). So for SubCell benchmarking, multiple metrics were usually shown to evaluate one task.

Soliciting community feedback

As the ML field progresses, more and more ML models are shifting focus from proof of concept toward real biological applications, with the ultimate goals of replacing experiments and reducing the time and cost required to make new discoveries. This requires close collaboration between model developers and biologists to align on goals, define what is useful, and create benchmarking practices to support these efforts. After all, benchmarking is the primary way to build trust in the performance and applicability of models and serves as a bridge between model developers and biologists.

It is equally important to stress that both model development and benchmarking are community efforts that should be guided by community needs and improved through community input. The intention of the platform is to serve as a hub where model developers and biologists can come together to make suggestions, spark discussions, and collaboratively improve the status quo. We hope to facilitate conversations related to benchmarking tasks, such as:

What comes to mind after reading this article? What was missing?
What are some good datasets that support biologically relevant benchmarking tasks that you would like to suggest to us? What would a good dataset look like?
What are some benchmarking tasks and metrics you think are useful for imaging-based ML models?
What are the challenges in building trust and helping biologists adopt these ML models?

Stay tuned for more developments on the platform, and please feel free to reach out to us at virtualcellmodels@chanzuckerberg.com with your thoughts!

References

Bunne C, Roohani Y, Rosen Y, et al. How to build the virtual cell with artificial intelligence: Priorities and opportunities. Cell. 2024;187(25):7045-7063. doi:10.1016/j.cell.2024.11.015
Gupta A, Wefers Z, Kahnert K, et al. SubCell: Vision foundation models for microscopy capture single-cell biology. Preprint. bioRxiv. 2024;2024.12.06.627299. doi: 10.1101/2024.12.06.627299
Kobayashi H, Cheveralls KC, Leonetti MD, Royer LA. Self-supervised deep learning encodes high-resolution features of protein subcellular localization. Nat Methods. 2022;19(8):995-1003. doi:10.1038/s41592-022-01541-z
Doron M, Moutakanni T, Chen ZS, et al. Unbiased single-cell morphology with self-supervised vision transformers. Preprint. bioRxiv. 2023;2023.06.16.545359. doi:10.1101/2023.06.16.545359
Cho NH, Cheveralls KC, Brunner AD, et al. OpenCell: Endogenous tagging for the cartography of human cellular organization. Science. 2022;375(6585):eabi6983. doi:10.1126/science.abi6983
Gerbin KA, Grancharova T, Donovan-Maiye RM, et al. Cell states beyond transcriptomics: Integrating structural organization and gene expression in hiPSC-derived cardiomyocytes. Cell Syst. 2021;12(6):670-687.e10. doi:10.1016/j.cels.2021.05.001
Thul PJ, Åkesson L, Wiking M, et al. A subcellular map of the human proteome. Science. 2017;356(6340):eaal3321. doi:10.1126/science.aal3321
Viana MP, Chen J, Knijnenburg TA, et al. Integrated intracellular organization and its variations in human iPS cells. Nature. 2023;613(7943):345-354. doi:10.1038/s41586-022-05563-7
McInnes L, et al. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 2018;3(29), 861, doi:10.21105/joss.00861
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale. Preprint. arXiv. 2010;2010.11929, doi: 10.48550/arXiv.2010.11929
Moshkov N, Bornholdt M, Benoit S, et al. Learning representations for image-based profiling of perturbations. Nat Commun. 2024;15(1):1594. doi:10.1038/s41467-024-45999-1
Celik S, Hütter JC, Carlos SM, et al. Building, benchmarking, and exploring perturbative maps of transcriptional and morphological data. PLoS Comput Biol. 2024;20(10):e1012463. doi:10.1371/journal.pcbi.1012463
Maier-Hein L, Eisenmann M, Reinke A, et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nat Commun. 2018;9(1):5217. doi:10.1038/s41467-018-07619-7