SubCell Training Data From Human Protein Atlas
Version v1.0, processed
released 18 Nov 2024
processed
released 18 Nov 2024- The Lundberg Lab (Stanford University)
This dataset contains cropped images from the Human Protein Atlas (HPA) Subcellular Section that were generated to train the SubCell models. The original HPA images (version 23) were segmented to identify cells and 1024 x 1024 pixel crops were made centering around each identified cell. Both cropped images (8-bit or 16-bit) and their corresponding segmentations (8-bit) are provided in PNG format.
Dataset Overview
Data Type
Fluorescence Microscopy Images
Citation
Publication: Available Winter 2024
Dataset Card Authors
Chan Zuckerberg Initiative
Dataset Card Contact
virtualcellmodels@chanzuckerberg.comUses
Primary Use Cases
- Train the SubCell models
- Run through a trained SubCell models to compute embeddings and attention maps
- Train other image-based models
- Compare or benchmark performance of machine learning models
Out-of-Scope or Unauthorized Use Cases
Do not use the dataset for the following purposes:
- Usage not covered by the license.
- Any use that is not in accordance with the Acceptable Use Policy
Intended Users
- ML specialists
Dataset Structure
In s3://czi-subcell-public/hpa-processed/cell_crops/
each subfolder contains all cropped images (1024 x 1024 pixel)
from one 96-well plate. The beginning of the filename (e.g. "10_A1") indicates the plate number and the well position,
followed by field-of-view number and cell number. The "_cell_image.png" files contain four fluorescent channels stored
in the four channels of the PNG file: microtubule fluorescence in the Blue (B) channel, endoplasmic reticulum
fluorescence in the Green (G) channel, DNA fluorescence in the Red (R) channel, and the protein of interest fluorescence
in the Alpha (A) channel. Note these are four-channel PNGs and most of them are 16-bit with a small fraction being
8-bit. If using Python, it is recommended to use cv2.imread(file_path, -1)
to read the PNG into a NumPy array (the
channel order will be BGRA in the resulting NumPy array). The "_cell_mask.png" files contain cell masks created by
HPA-Cell-Segmentation in 8-bit
PNG format. A CSV file is provided for each field of view images that contains metadata and annotations. A combined
metadata file can also be found in s3://czi-subcell-public/hpa-processed/
.
Personal and Sensitive Information
No personal and sensitive information is included.
Dataset Creation
Curation Rationale
This dataset was created to train the SubCell models and also to generate output embeddings.
Source Data
Human Protein Atlas Subcellular SectionWho are the source data producers?
The Human Protein Atlas Subcellular Profiling group
Data Collection and Processing
The TIFF images of the Human Protein Atlas (HPA) Subcellular Section v23 were used. In these images, each field of view was acquired with 4 channels: microtubule, endoplasmic reticulum, DNA and protein of interest. To prepare these images for SubCell, cells were identified using HPA-Cell-Segmentation and 1024 x 1024 pixel crops were made centering around each identified cell, and saved in PNG format to generate this dataset. Crops from 16-bit TIFF images were saved as 16-bit PNGs, and the same applies to 8-bit images.
Note that to train the SubCell models, images in this dataset were further cropped to 896 x 896 pixels, and then underwent 2x binning to become 448 x 448 pixels. Cells on the edge were removed and if two single-cell crops contained two cells too close to each other, only one of them was used. To run these images through a trained model to obtain model outputs, a 640 x 640 pixels crop in the center was used without binning as that was the size that showed to have worked well.
Annotation Process
The annotation for "locations" in the metadata CSV came from HPA. It is protein localization for this field of view image based on manual annotation by trained experts. More details see https://v23.proteinatlas.org/about/assays+annotation#if. No further annotation was done.
Who are the annotators?
The annotation was done by the team who generated and processed the data.
Bias, Risks, and Limitations
- This dataset contains cropped images from field-of-view images and some overlap between cropped images is expected if they originate from the same field-of-view image.
- Some biases and limitations from the original dataset may also exist in this dataset. See dataset “Human Protein Atlas Subcellular Section“ for information about the source data.
Acknowledgements
See Reference.