Simplified overview of the tutorial showing one set of single cell crop images, SubCell model, an embedding, and a UMAP from the whole dataset

Examining Protein Localization Changes Following SARS-CoV-2 Infection

Estimated time to complete: 60 minutes

Learning Goals

  • Learn about image embeddings and SubCell models
  • Understand SubCell model inputs and outputs
  • Run SubCell model inference
  • Use dimensionality reduction to interpret SubCell image embeddings
  • Examine changes in protein localization following SARS-CoV-2 infection

Prerequisites

If running the tutorial locally, the following packages are needed:

This tutorial can be run locally with CPU. However, GPU compute is enabled and will significantly speed up computations. In Google Colab, the T4 GPU is recommended.

Overview

Introduction

Introduction to Image Embeddings

In machine learning (ML), embeddings are simplified representations of the input data that encode the key features of the data. Embeddings are a fundamental part of many ML models’ operation since they enable the model to learn complex patterns within the data and represent those patterns in a more compact way. In addition to aiding computation, embeddings themselves can be valuable model outputs. Models that take in data and output embeddings are called encoder models. Encoder models are particularly valuable for analyzing large, complex datasets, where identifying patterns directly from the raw data can be challenging. Since embeddings are rich representations of the data, subsequent ML models can be trained using embeddings from encoder models instead of using the raw data. For example, classifier models can be trained on the output of image encoder models.

Exploring the embedding space can also reveal insights about the underlying raw data. For instance, outliers in the embedding space may indicate anomalies in the input data, or embedding similarity can indicate groupings of the raw data. In either case, exploring the embedding space can be facilitated by further reducing the dimensions of the embeddings to a 2D or 3D space for visual inspection. In this tutorial, the Uniform Manifold Approximation and Projection (UMAP) method will be used to reduce the dimensionality of the embeddings.

Introduction to SubCell

SubCell is a suite of image encoder models developed by Ankit Gupta in Professor Emma Lundberg’s lab. The SubCellPortable Github repository provides code to run the suite of models along with classifier models that were trained on the SubCell embeddings to classify which of 31 different localization categories, such as nucleoli, vesicles, or mitochondria, correspond to a given protein of interest.

Each image encoder model takes as inputs fluorescence microscopy images of cells stained for a protein of interest along with reference markers (nuclei, microtubules, and endoplasmic reticulum) and outputs image embeddings. SubCell models were trained with different combinations of reference markers and each model therefore expects different input channels to run inference. These are summarized in the table below along with the abbreviations used for each model in SubCellPortable. For each set of reference markers, 2 encoder models are available: the "ViT" version was trained with only protein-specific loss, and the "MAE" version was trained with Masked Autoencoder (MAE) Reconstruction Loss, cell-specific and protein-specific losses.

SubCell Model
SubCellPortable Name
Reference Images Required
DNA-protein
bg
nuclei
MT-DNA-protein
rbg
microtubules and nuclei
all-channels
rybg
microtubules, ER, and nuclei
ER-DNA-Protein
ybg
ER and nuclei

SubCell was trained on individual cell crops from the Human Protein Atlas (HPA) SubCellular data, which includes immunofluorescence of 13,147 proteins of interest and 37 different human cell lines. Below are example field of view images for each of the 4 channels in the 2D HPA data: endoplasmic reticulum (yellow), nucleus (blue), microtubules (red), and protein of interest (green).

Example 4 channel images from HPA

This tutorial demonstrates how to run model inference with SubCell. Model inference is the process of feeding input data into a trained machine learning model, in this case a SubCell model, where weights are all learned and frozen, to compute outputs such as embeddings.

SubCell model inference has specific requirements for input data:

  • The images must be 2D, so for 3D data, use the max projection along z to create 2D images.
  • The resolution of the images must be high enough to segment individual cells and resolve protein patterns.
  • Each input image must be of a single cell, so field of view images should segmented into individual cells and each cell cropped from the field of view image and saved as a separate file. Any cell segmentation method may be used for this step.
  • Each channel in the cell crop must be saved as a separate PNG file.
  • Images must be of 640 x 640 pixels in size with a 80.0885 nm pixel size, rescaling of the image pixel size and resizing of the images may be necessary.

Setup

Google Colab and SubCellPortable must be set up to complete this tutorial.

However, with some modification the same tutorial can be run locally with the demo data provided or with your own data! If you choose to work locally, to follow best practices, an environment manager should be used. Environment managers allow the creation of multiple virtual environments, like separate sandboxes, on a computer for installing programs, like SubCellPortable, without affecting other parts of the system. Virtualenv is the recommended environment manager for SubCellPortable.

Setup Google Colab

This tutorial is a notebook that can be run within the Google Colab interface.

To start, connect to the T4 GPU runtime hosted for free by Google Colab using the dropdown menu in the upper right hand corner of this notebook. Using a GPU significantly speeds up running model inference.

Note that this tutorial will use commands written for Google Colab, and some of those commands may need to be modified to work with other computing setups.

Setup SubCellPortable

SubCellPortable is a convenient code wrapper to run the SubCell models in inference along with the provided classifer models in a local environment, or in this case, on Google Colab.

To run SubCellPortable in Google Colab, start by cloning the SubCellPortable repo and navigate to the newly created SubCellPortable folder using the commands below. The folder will also be present in the file management system in Google Colab which is accessible by clicking the folder icon on the left hand side bar of this notebook.

# clone the SubCellPortable repo
!git clone https://github.com/CellProfiling/SubCellPortable.git

# navigate the SubCellPortable directory
%cd /content/SubCellPortable

Overview of SubCellPortable

SubCellPortable contains several items in its top level directory, which are described in the table below.

File Name
Description/Purpose
models
Subdirectory containing information about models available for inference
LICENSE
Licensing information; SubCell is licensed under the MIT License (https://opensource.org/license/mit).
README.md
Summary of SubCell usage and requirements
config.yaml
Optional file for specifying inference parameters
inference.py
Submodule that defines functions for running inference; used in `process.py`
models_urls.yaml
Optional file for specifying the URLs for downloading the models
path_list.csv
Example of the required file that specifies data location for model inference
process.py
Master module for running model inference; call `process.py` to run model inference
requirements.txt
List of required packages for running SubCell
vit_model.py
Submodule for running vision transformer; used in `process.py`

The packages required for model inference are listed in requirements.txt and below for convenience.

scikit-image==0.22.0
torch==2.4.1
torchvision==0.19.1
PyYAML==6.0.1
transformers==4.45.1
numpy==1.26.4
pandas==2.2.3
requests==2.32.3

Install those packages using the following. This may take a few minutes, and note that errors in the output dialog will not prevent proceeding with the tutorial as long as the code cell finishes running.

# Install all packages in requirements.txt
!pip install -r requirements.txt

Models Subdirectory

The models subdirectory contains a folder for each of the available models, i.e. ybg or rygb, and in each of those folders, 2 folders mae_contrast_supcon_model and vit_supcon_model correspond to 2 versions of the encoder models that were trained with different loss functions. For each model training, the model_config.yaml file specifies the model location on the machine and the parameters for running the model.

The model_config.yaml file can be manually updated with the absolute path to the model if the model has been downloaded on the machine. However, it is updated automatically using the models_urls.yaml file whenever the update_models parameter in process.py is set to True. In this tutorial, a downloaded model will be used, so the model_config.yaml file will be manually updated before running model inference.

path_list.csv

In SubCellPortable, the path_list.csv file specifies the location of the input cell crops (640 x 640 pixel PNG files) on the machine and defines output parameters. In path_list.csv, each row corresponds to a cell crop image and the first 4 columns correspond to the locations of the images for each channel in the following order: microtubule marker, endoplasmic reticulum (ER) marker, nuclei marker, protein of interest marker, and are referred to as r_image, y_image, b_image, and g_image, respectively, in SubCellPortable. Depending on the selected model, some of the image location columns can be left blank. The last 2 columns are output_folder and output_prefix, which specify where to store the model output and what unique prefix to append to the resulting files for a given cell crop image (see the Model Outputs section for details on these output files). Every entry in path_list.csv is the (relative or absolute) path to the corresponding image file on the machine, or in this case, in Google Colab's file management system. An example path_list.csv file is provided in the repo. Double click on the file to open a preview in Google Colab. Note the # comments out the cell, so the first row that names the columns all have #.

Use Case

SubCell can be used for a wide variety of applications that involve exploring protein localization patterns. In this tutorial, changes to protein localization patterns in vitro following infection with severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) will be explored building off of the study 'Subcellular mapping of the protein landscape of SARS-CoV-2 infected cells for target-centric drug repurposing' by JM Kaimal et al. SARS-CoV-2 infection causes COVID-19 in humans, and understanding how infection impacts cells in vitro can elucidate mechanisms of action that reveal potential preventative and treatment options for COVID-19.

In the study referenced above, the authors used antibodies from the Human Protein Atlas for immunofluorescence imaging to analyze changes in host protein levels and subcellular localization upon SARS-CoV-2 infection. Using 602 antibodies targeting 662 genes, the team conducted immunostaining in infected and non-infected Vero E6 cells with markers for SARS-CoV-2 infection, endoplasmic reticulum, nucleus, and protein of interest. Images were acquired with 9 fields of view and 3 z planes per protein of interest. The images were analyzed using the Covid Image Annotator tool on the ImJoy platform, where a DPNUnet model segmented cells and identified infected vs. non-infected cells based on staining of the SARS-CoV-2 nucleocapsid protein. Through laborious, manual image annotation, they identified 97 proteins that exhibited either spatial redistribution or altered abundance between infected and non-infected cells. In the future, SubCell or a similar model may be able to replace this time-intensive process.

The available raw data is not in the required format for SubCell. The images were acquired with Opera Phenix high-content microscope (PerkinElmer) in confocal mode with a 63X water objective with a binning factor of 2 resulting in an effective pixel size of 205 nm, and the data is of the full field of view images including multiple cells per image.

To prepare this data for SubCell, images were segmented into individual cell crops, rescaled to an 80 nm pixel size, resized to 640 x 640 pixels in size with the cell located at the center of the cell crop image, and each channel (ER, nuclues, and protein) was saved as its own PNG file. These PNG files do not include SARS-CoV-2 virus staining. Instead, the infection status is indicated with a 1 for infected and 0 for uninfected in the third column of a metadata file, single_cell_metadata.csv, which also includes corresponding information on the antibody used in each well to label the relevant protein of interest.

Only a subset of data for 4 proteins will be examined in this tutorial. A table of the genes associated with the antibody stain along with the well IDs for the images, and observed localization changes is found below.

Gene
Well
Observed Change
TMEM214
1_E3
None
HSPA*
2_H10
Spatial, Intensity up
GANAB
4_G11
Spatial
CMPK1
6_F5
Spatial

Download Data, Metadata, and Models for the Tutorial

Since the data contains ER and nuclei reference markers, a ybg model will be used for inference. In this tutorial, the "MAE" version of the ybg model will be used, but please refer to the SubCell preprint paper for guidance on which model version is most appropriate for a given use case. To examine changes in localization, one of the provided classifier models will be used.

Image data, the metadata file single_cell_metadata.csv, a prepared path_list.csv, ybg model, and a classifier model can be downloaded as a zip file using the below code or manually from this link.

# download zip file containing data, models, and path_list.csv
!gdown --fuzzy https://drive.google.com/file/d/1-Vym0Yr2ZGnqRX4UsIorDH1vXBqzSNKF/view?usp=sharing

A file, subcell_tutorial_data_models.zip should now appear in the file manager under the SubCellPortable directory. To unzip this file and replace the example path_list.csv file in SubCellPortable with one prepared for this tutorial, use the following. This may take a few minutes.

# unzip the file and replace the path_list.csv example file with the one in the zip file
!unzip -o subcell_tutorial_data_models.zip

Unzipping the file, creates 4 entities:

  • images folder with 3 subfolders, er, nucleus, and protein, containing the images corresponding to each channel.
  • subcell_encoder_model.pth (332 MB) file is the image encoder model.
  • subcell_classifier_model.pth (3 MB) file is the classifier model.
  • path_list.csv is the prepared metadata file for the images.

In path_list.csv, the image locations are relative paths, the output folder is called output for all images, and the unique output prefixes were defined using information from single_cell_metadata.csv, where for each cell crop image, its well, image, and cell ID are combined with infection status ("infected" or "uninfected"). For example, the output prefix "1_E3_3_25_uninfected" indicates that cell number 25 from the image 3 collected from well 1_E3 was not infected with SARS-CoV-2. A careful choice of output prefix aids annotation of the resulting UMAPs as will be demonstrated later in the tutorial.

Run Model Inference

To run inference, the models and data locations must be specified. The downloaded path_list.csv file already specifies the data locations using relative paths, and the model can be specified with the instructions below.

Specify Model

The paths to the subcell_encoder_model.pth and subcell_classifier_model.pth models must be set in the file /content/SubCellPortable/models/ybg/mae_contrast_supcon_model/model_config.yaml. Double click the link, edit the classifier_paths and encoder_path fields with the locations of the model in Google Colab, and save it (CTRL+S or CMD+S). The updated file is shown below:

classifier_paths:
   - "/content/SubCellPortable/subcell_classifier_model.pth"
encoder_path: "/content/SubCellPortable/subcell_encoder_model.pth"
model_config:
 vit_model:
     hidden_size: 768
     num_hidden_layers: 12
     num_attention_heads: 12
     intermediate_size: 3072
     hidden_act: "gelu"
     hidden_dropout_prob: 0.0
     attention_probs_dropout_prob: 0.0
     initializer_range: 0.02
     layer_norm_eps: 1.e-12
     image_size: 448
     patch_size: 16
     num_channels: 3
     qkv_bias: True
 pool_model:
     dim: 768
     int_dim: 512
     num_heads: 2
 num_classes: 31

Choose Method for Running Inference

There are 3 options for running SubCellPortable for model inference. This tutorial uses the Option 1 method.

Each option is described below with the same parameters, namely running SubCellPortable with the "MAE" version of the ybg model; using the model information in the model_config.yaml file instead of downloading the model using the models_urls.yaml file; creating the optional result.csv output file; and running inference with a GPU (T4 GPU in Google Colab is used in this tutorial, but use "-1" instead of "0" for CPU compute).

Option 1: Using constants in process.py

If you plan to run SubCell with the same model and parameters, simply edit the constants directly in lines 35-40 in /content/SubCellPortable/process.py and run model inference with: python process.py

Double click the above link, edit lines 36-40 with the code below, and save it before continuing.

config["model_channels"] = "ybg" # select the ybg model
config["model_type"] = "mae_contrast_supcon_model" # select the model type
config["update_model"] = False # use the model in the models folder
config["create_csv"] = True # create results.csv
config["gpu"] = 0 # use the GPU numbered 0 (use -1 for CPU compute)

The other options are described below for convenience, but skip to the next cell to continue the tutorial.

Option 2: Using flags to specify parameters

Alternatively, you can specify all the model parameters as additional flags when calling python process.py. To see a list of available parameters, call: python process.py –-help

The available flags along with their arguments are described in the table below.

Flag
Argument
Usage
--help
N/A
Show the help message that lists optional arguments
-c
rybg, rbg, ybg, or bg
Model to be used
-t
mae_contrast_supcon_model or vit_supcon_model
Model version to be used; mae_contrast_supcon_model is default
-u
True or False
update_model parameter; Sets whether the model should be downloaded using the URL in the models_config.yaml file
-csv
True or False
Create results.csv
-g
-1, 0, 1, 2, 3
-1: use CPU; otherwise, GPU ID to use for inference

An example call that achieves the same result as described above is:

python process.py -c ybg -t mae_contrast_supcon_model -u False -csv True -g 0

Option 3: Using the config.yaml file

The final option for running SubCellPortable is to edit the config.yaml file with the desired parameters and simply call: python process.py

An example config.yaml file is below:

model_channels: "ybg"
model_type: "mae_contrast_supcon_model"
update_model: False
create_csv: True
gpu: 0

Note that the default config.yaml file in SubCellPortable has all of the fields commented out using the # symbol at the start of each line.

Start Model Inference

Simply run the next code cell to start inference on the data. This may take a few minutes.

# Run Model Inference (Using Option 1 with parameters defined in process.py)
!python process.py

Model Outputs

SubCellPortable has 3 default outputs and 2 optional ones summarized in the table below.

Output
Description
Type
*_embedding.npy
1536 dimension embedding vector
Default
*_probabilities.npy
Array of weighted probabilities of each subcellular location class
Default
*_attention_map
64x64 PNG thumbnail of the attention map
Default
log.txt
Command line interface log from running the model
Optional
results.csv
Metadata, probability arrays, and embeddings for all cells
Optional

Attention maps are grayscale images that indicate where the model has focused with higher values corresponding to areas where the model has had larger focus. The "MAE" version of the ybg model has 2 attention heads, which like a human observer, appear to focus on the cytosol and nuclei of the cells. An example of an attention map from running the model, which contains the results for each attention head, is shown below.

Attention map

SubCellPortable has IDs for each localization category. These IDs are referenced in *_probabilities.npy and result.csv. The localization categories with their corresponding IDs can also be found in /content/SubCellPortable/inference.py in lines 7-39 under CLASS2NAME. Double click the link to open inference.py and review the mappings of IDs to localization category if desired.

result.csv is a compilation of metadata, probability arrays, and embeddings for all cells, which is a convenient collection of data for subsequent analysis.

In result.csv, each row corresponds to one cell, and the columns of result.csv are described below:

Column
Description
id
Specified output_prefix
top_class_name
Name of the most likely localization class
top_class
ID of the most likely localization class
top_3_classes_names
Names of the top 3 most likely localization classes
top_3_classes
IDs of the top 3 most likely localization classes
prob00 - prob30
Probabilities array for all localization classes
feat0000 - feat1535
1536 dimension embedding vector

The rest of this tutorial will describe how to analyze result.csv to explore the embeddings.

Analysis of Model Outputs

The UMAP dimensionality reduction algorithm will be used to enable visualization of the embeddings. Google Colab has several libraries pre-installed for numerical data analysis and visualization, which will be imported in the notebook, but it does not have the library for UMAP pre-installed, so to start, install umap-learn with the following.

!pip install umap-learn
# Import libraries that come pre-installed in Google Colab
import pandas as pd # data analysis library
import matplotlib.pyplot as plt # visualization library
import seaborn as sns # visualization library

# Import umap-learn library that was installed in the above cell
import umap # dimensionality reduction library

To apply UMAP, the embedding vectors need to be read, and to interpret the resulting UMAP, the reduced embeddings need to be annotated with the gene labeled and the infection status of the cells. The result.csv file can be read and modified using the code below to append 2 additional annotation columns, one for the gene and one for the infection status of the cell.

# read result.csv into a pandas dataframe
df = pd.read_csv("/content/SubCellPortable/result.csv")

# define 'ids' as the id column from `result.csv` (output prefix for a cell)
ids = df["id"]

# add a column 'gene' that captures the well id of the cell (extract just the first 2 elements of the id column)
df["gene"] = ids.str.extract(r'^([^_]+_[^_]+)')[0]

# Convert the well ids to gene names in the 'gene' column
dataset_mapping = {
    '1_E3': 'TMEM214',
    '2_H10': 'HSPA*',
    '4_G11': 'GANAB',
    '6_F5': 'CMPK1'
}
df['gene'] = df['gene'].replace(dataset_mapping)

# add a column 'infection_status' that indicates whether the cell was infected or unifected (extract the last element of the id)
df["infection_status"] = ids.str.extract(r'([^_]+)$')[0]

The UMAP should only be performed on the embedding vectors in result.csv, so next, extract the columns containing the embedding vectors and perform UMAP on those vectors. For more information on the UMAP library, refer to this documentation.

# extract the embedding vectors (all columns that start with 'feat')
features = df.loc[:, df.columns.str.startswith("feat")]

# apply UMAP to the embedding vectors
reducer = umap.UMAP()
reduced_features = reducer.fit_transform(features)

To understand how the data has transformed with each modification, optionally display the data in df, features, and reduced_features with the 3 code cells below.

# OPTIONAL: view the data in df = result.csv with 2 new columns ('gene', 'infection_status')
df
# OPTIONAL: view the data in features = 1536 dimension embeddings
features
# OPTIONAL: view the data in reduced_features = 2 column array with one row for each embedding
reduced_features

With the UMAP performed, the reduced embeddings can be visualized in 2D using matplotlib.

# Create a dataframe for the reduced embeddings along with their annotations
umap_df = pd.DataFrame(reduced_features, columns=["UMAP1", "UMAP2"])
umap_df["gene"] = df["gene"]
umap_df["infection_status"] = df["infection_status"]

# Use matplotlib and seaborn to visualize the results as a UMAP
plt.figure(figsize=(10, 8))
sns.scatterplot(
    data=umap_df,
    x="UMAP1",
    y="UMAP2",
    hue="gene",
    style="infection_status",
    palette="tab10",  # Set palette for unique colors per dataset
    markers=["o", "P"],  # Shapes for infected and uninfected
    alpha=0.7
)

# Add legend and title
plt.legend(title="Gene and Infection Status")
plt.title("UMAP Projection of Feature Data")
plt.show()
UMAP Projection of Feature Data

While scientific interpretation of the UMAP is not the focus of this tutorial, below is the summary table of the changes observed in the referenced study for each of the genes.

Gene
Observed Change
TMEM214
None
HSPA*
Spatial, Intensity up
GANAB
Spatial
CMPK1
Spatial

At a glance, the distribution of blue circles (infected) and crosses (uninfected) that correspond to the TMEM214 gene seem to have relatively more overlap than the distributions of the other 3 genes' reduced embeddings. This is in line with the observation in the study of no change to protein localization pattern between the infected and uninfected cells in the TMEM214 group, whereas, for the other 3 genes, changes were observed.

In the above UMAP, each datapoint corresponds to one cell in the data, and the spread of the datapoints for a given condition represents the variation present in the underlying image data. However, there are other methods of grouping data for a UMAP. To illustrate another way of grouping data, in the following sections, UMAP will be done on the averaged embeddings for each condition (i.e. uninfected cells with proteins from the GANAB gene labeled).

# Ensure 'feat' is a list of column names that start with "feat"
feat_columns = [col for col in df.columns if col.startswith("feat")]

# Group by 'dataset' and 'infection_status' and calculate the mean for only "feat" columns
averaged_feat = df.groupby(['gene', 'infection_status'])[feat_columns].mean().reset_index()

To understand how the data has transformed, optionally display the data in averaged_feat with the code cell below. Now there are only 8 rows for the 8 cases in the data.

# OPTIONAL: view the data in averaged_feat = 8 row array of averaged embeddings
averaged_feat

In these next cells, UMAP will be performed on the averaged embeddings, and the resulting reduced embeddings will be visualized using the same approach as used above.

# apply UMAP to the averaged embedding vectors
reduce_avg = umap.UMAP()
avg_embedding = reduce_avg.fit_transform(averaged_feat.iloc[:, 2:]) # using iloc to select the columns with the embeddings
# Create a dataframe for the UMAP results and labels
umap_avg = pd.DataFrame(avg_embedding, columns=["UMAP1", "UMAP2"])
umap_avg["gene"] = averaged_feat["gene"]
umap_avg["infection_status"] = averaged_feat["infection_status"]
# Plotting
plt.figure(figsize=(10, 8))
sns.scatterplot(
    data=umap_avg,
    x="UMAP1",
    y="UMAP2",
    hue="gene",
    style="infection_status",
    palette="tab10",  # Set palette for unique colors per dataset
    markers=["o", "P"],  # Shapes for infected and uninfected
    alpha=0.7
)

# Add legend and title
plt.legend(title="Gene and Infection Status")
plt.title("UMAP Projection of Feature Data")
plt.show()
UMAP Projection of Feature Data

These average embeddings may not look exactly as expected, and the visualization above is for illustrative purposes only to demonstrate an example of how to average embeddings and visualize the resulting UMAP. Averaging embeddings before applying UMAP can obscure the natural variation between individual datapoints by collapsing unique features into a single averaged representation, potentially masking important group-specific patterns or clusters that UMAP would otherwise capture. UMAP approximates the underlying topological manifold of the dataset in a lower-dimensional space. Because this tutorial uses only a small subset of the data (4 genes out of a total of 662), the resulting embeddings do not represent the full variation and complexity of the manifold of the full dataset. As a result, patterns or groupings seen here may not generalize to the entire dataset and could be misleading if interpreted as comprehensive. In addition, when simplifying the data through averaging, multiple methods should be explored (e.g. averaging across images, averaging across genes, etc.), and whenever using strategies like UMAP, the effect of changes to parameters should be explored before the results are trusted for scientific evaluation. Check out

this example of exploring the parameter space from the UMAP-learn documentation

for more information.

The above steps and code can be modified to analyze the full dataset or your own data of interest for rigorous scientific inquiry.

Summary

Image embeddings are vector outputs of image encoder models that represent the essential features or patterns of images. Since embeddings capture the key features of the image data, images that are visually or categorically similar will have embeddings that are similar to each other in the embedding space. Embeddings can be used for computation or to train downstream models like classifiers, but embeddings themselves can also be valuable as model outputs. Exploring the embedding space can be facilitated by further reducing the dimensions of the embeddings to a 2D or 3D space for visual inspection using algorithms like UMAP.

SubCell is an image encoder model developed by Ankit Gupta in Professor Emma Lundberg’s lab. This model takes in fluorescence microscopy images of cells and outputs image embeddings along with predictions of the localizations of the proteins in the images from a classifier model that was trained on SubCell embeddings. SubCell can be used for a wide variety of applications that involve exploring protein localization patterns. In this tutorial, data from the study

"Subcellular mapping of the protein landscape of SARS-CoV-2 infected cells for target-centric drug repurposing" by JM Kaimal et al

were explored using SubCellPortable to examine how embeddings could represent changes to protein localization patterns in vitro following infection with SARS-CoV-2.

Contact and Acknowledgments

For issues with this tutorial please contact virtualcellmodels@chanzuckerberg.com.

Special thank you to Ankit Gupta, William Leineweber, and Frederic Ballllosera from Professor Emma Lundberg's lab for their consultation on this tutorial.

References

Responsible Use

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.

Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.