HPA LMDB dataset

Version v1.0.0, processed
released 05 Aug 2025

License

Repository

https://github.com/BoHuangLab/CELL-Diff/tree/main/processed_datasets

Dataset Type

Fluorescence Microscopy Images

The HPA LMDB dataset includes immunofluorescence microscopy images of human proteins and their corresponding amino acid sequences. The microscopy images were sourced from the Human Protein Atlas (version 23.0), while the amino acid sequences were obtained from UniProt. Each dataset entry includes a protein sequence, a protein localization image, and associated cell morphology images (nucleus, endoplasmic reticulum, and microtubules). All data are packaged in LMDB format for efficient access.

Developed By

Huang Lab, UCSF

Explore Dataset

Dataset Overview

Data Type

Fluorescence Microscopy Images

Citation

Zheng, D. and Huang, B. (2025) Bridging Protein Sequences and Microscopy Images with Unified Diffusion Models. Forty-second International Conference on Machine Learning. OpenReview

Dataset Card Authors

Dihan Zheng

Dataset Card Contact

Dihan Zheng (dihan.zheng@ucsf.edu)

Uses

Primary Use Cases

Image generation
Sequence generation

Out-of-Scope or Unauthorized Use Cases

Do not use the dataset for the following purposes:

Use that violates the CC BY-SA 3.0 license.
Any use that is not in accordance with the Acceptable Use Policy.

Dataset Structure

The HPA LMDB dataset is designed to enable multimodal learning between protein sequences and their subcellular localization. Each record contains a protein sequence and corresponding microscopy images, covering protein localization and cellular structures (nucleus, ER, microtubules). The data is organized in LMDB format for fast access.

Personal and Sensitive Information

The dataset does not include personally identifiable information (PII) or sensitive data.

Dataset Creation

Curation Rationale

The dataset was curated to enable training and evaluation of generative models that bridge protein sequences and subcellular localization microscopy images.

Source Data

Human Protein Atlas (version 23.0)
UniProt (release 2025_03)

Data Collection and Processing

Microscopy images were downloaded from the Human Protein Atlas Subcellular Section (version 23.0). Protein sequences were retrieved from UniProt. The data were matched and integrated into a single LMDB dataset without further processing.

Annotation Process

Details about the annotation process for the HPA dataset can be found in the HPA website.

Who are the annotators?

The HPA and protein sequence datasets were annotated by the original authors.

Biases, Risks, and Limitations

Limitations

The dataset is specific to immunofluorescence imaging.

Acknowledgements

We thank the data generators and annotators for their contributions to the Human Protein Atlas and UniProt datasets.