HPA LMDB dataset
Version v1.0.0, processedreleased 05 Aug 2025
License
CC BY-SA 3.0Dataset Type
Fluorescence Microscopy Images
The HPA LMDB dataset includes immunofluorescence microscopy images of human proteins and their corresponding amino acid sequences. The microscopy images were sourced from the Human Protein Atlas (version 23.0), while the amino acid sequences were obtained from UniProt. Each dataset entry includes a protein sequence, a protein localization image, and associated cell morphology images (nucleus, endoplasmic reticulum, and microtubules). All data are packaged in LMDB format for efficient access.
Developed By
Dataset Overview
Data Type
Fluorescence Microscopy Images
Citation
Zheng, D. and Huang, B. (2025) Bridging Protein Sequences and Microscopy Images with Unified Diffusion Models. Forty-second International Conference on Machine Learning. OpenReview
Dataset Card Authors
Dihan Zheng
Dataset Card Contact
Dihan Zheng (dihan.zheng@ucsf.edu)
Uses
Primary Use Cases
- Image generation
- Sequence generation
Out-of-Scope or Unauthorized Use Cases
Do not use the dataset for the following purposes:
- Use that violates the CC BY-SA 3.0 license.
- Any use that is not in accordance with the Acceptable Use Policy.
Dataset Structure
The HPA LMDB dataset is designed to enable multimodal learning between protein sequences and their subcellular localization. Each record contains a protein sequence and corresponding microscopy images, covering protein localization and cellular structures (nucleus, ER, microtubules). The data is organized in LMDB format for fast access.
Personal and Sensitive Information
The dataset does not include personally identifiable information (PII) or sensitive data.
Dataset Creation
Curation Rationale
The dataset was curated to enable training and evaluation of generative models that bridge protein sequences and subcellular localization microscopy images.
Source Data
- Human Protein Atlas (version 23.0)
- UniProt (release 2025_03)
Data Collection and Processing
Microscopy images were downloaded from the Human Protein Atlas Subcellular Section (version 23.0). Protein sequences were retrieved from UniProt. The data were matched and integrated into a single LMDB dataset without further processing.
Annotation Process
Details about the annotation process for the HPA dataset can be found in the HPA website.
Who are the annotators?
The HPA and protein sequence datasets were annotated by the original authors.
Biases, Risks, and Limitations
Limitations
- The dataset is specific to immunofluorescence imaging.
Acknowledgements
We thank the data generators and annotators for their contributions to the Human Protein Atlas and UniProt datasets.