SubCell Training Data From Human Protein Atlas

Version v1.0,
processed
released 18 Nov 2024

License

Creative Commons Attribution-ShareAlike 3.0 International License

Repository

https://github.com/CellProfiling/subcell-analysis

Developed By

The Lundberg Lab (Stanford University)

This dataset contains cropped images from the Human Protein Atlas (HPA) Subcellular Section that were generated to train the SubCell models. The original HPA images (version 23) were segmented to identify cells and 1024 x 1024 pixel crops were made centering around each identified cell. Both cropped images (8-bit or 16-bit) and their corresponding segmentations (8-bit) are provided in PNG format.

Download Dataset

Dataset Overview

Data Type

Fluorescence Microscopy Images

Citation

SubCell: Vision foundation models for microscopy capture single-cell biology. Ankit Gupta, Zoe Wefers, Konstantin Kahnert, Jan N Hansen, William D. Leineweber, Anthony Cesnik, Dan Lu, Ulrika Axelsson, Frederic Ballllosera Navarro, Theofanis Karaletsos, Emma Lundberg. bioRxiv 2024.12.06.627299; doi: https://doi.org/10.1101/2024.12.06.627299.

Dataset Card Authors

Chan Zuckerberg Initiative

Dataset Card Contact

virtualcellmodels@chanzuckerberg.com

Uses

Primary Use Cases

Train the SubCell models
Run through a trained SubCell models to compute embeddings and attention maps
Train other image-based models
Compare or benchmark performance of machine learning models

Out-of-Scope or Unauthorized Use Cases

Do not use the dataset for the following purposes:

Usage not covered by the license.
Any use that is not in accordance with the Acceptable Use Policy

Intended Users

ML specialists

Dataset Structure

In s3://czi-subcell-public/hpa-processed/cell_crops/ each subfolder contains all cropped images (1024 x 1024 pixel) from one 96-well plate. The beginning of the filename (e.g. "10_A1") indicates the plate number and the well position, followed by field-of-view number and cell number. The "_cell_image.png" files contain four fluorescent channels stored in the four channels of the PNG file: microtubule fluorescence in the Blue (B) channel, endoplasmic reticulum fluorescence in the Green (G) channel, DNA fluorescence in the Red (R) channel, and the protein of interest fluorescence in the Alpha (A) channel. Note these are four-channel PNGs and most of them are 16-bit with a small fraction being 8-bit. If using Python, it is recommended to use cv2.imread(file_path, -1) to read the PNG into a NumPy array (the channel order will be BGRA in the resulting NumPy array). The "_cell_mask.png" files contain cell masks created by HPA-Cell-Segmentation in 8-bit PNG format. A CSV file is provided for each field of view images that contains metadata and annotations. A combined metadata file can also be found in s3://czi-subcell-public/hpa-processed/.

Personal and Sensitive Information

No personal and sensitive information is included.

Dataset Creation

Curation Rationale

This dataset was created to train the SubCell models and also to generate output embeddings.

Source Data

Human Protein Atlas Subcellular Section

Who are the source data producers?

The Human Protein Atlas Subcellular Profiling group

Data Collection and Processing

The TIFF images of the Human Protein Atlas (HPA) Subcellular Section v23 were used. In these images, each field of view was acquired with 4 channels: microtubule, endoplasmic reticulum, DNA and protein of interest. To prepare these images for SubCell, cells were identified using HPA-Cell-Segmentation and 1024 x 1024 pixel crops were made centering around each identified cell, and saved in PNG format to generate this dataset. Crops from 16-bit TIFF images were saved as 16-bit PNGs, and the same applies to 8-bit images.

Note that to train the SubCell models, images in this dataset were further cropped to 896 x 896 pixels, and then underwent 2x binning to become 448 x 448 pixels. Cells on the edge were removed and if two single-cell crops contained two cells too close to each other, only one of them was used. To run these images through a trained model to obtain model outputs, a 640 x 640 pixels crop in the center was used without binning as that was the size that showed to have worked well.

Annotation Process

The annotation for "locations" in the metadata CSV came from HPA. It is protein localization for this field of view image based on manual annotation by trained experts. More details see https://v23.proteinatlas.org/about/assays+annotation#if. No further annotation was done.

Who are the annotators?

The annotation was done by the team who generated and processed the data.

Bias, Risks, and Limitations

This dataset contains cropped images from field-of-view images and some overlap between cropped images is expected if they originate from the same field-of-view image.
Some biases and limitations from the original dataset may also exist in this dataset. See dataset “Human Protein Atlas Subcellular Section“ for information about the source data.

Acknowledgements

See Reference.