Benchmarking PBMC Dataset from Zheng et al. (Zheng68K)

Version v1.0,
processed
released 30 Jul 2020

License

Repository

https://huggingface.co/datasets/genbio-ai/cell-downstream-tasks/tree/main/zheng

Developed By

Wenpin Hou
Zhicheng Ji
Hongkai Ji
Stephanie C. Hicks
(Johns Hopkins Bloomberg School of Public Health)

The Zheng68K dataset from human peripheral blood mononuclear cells (PBMC) is one of the most widely used datasets for benchmarking human cell-type annotation methods. The dataset was originally created by Zheng et al. to profile ~68,000 PBMCs from a healthy donor while evaluating a high-throughput method for single-cell RNA sequencing of immune cell populations. The Zheng68K dataset was then processed for benchmarking studies. This data card describes a processed Zheng68K dataset used to evaluate model performance for cell type classification.

Download Dataset

Dataset Overview

Data Type

Single-cell RNA sequencing (scRNA-seq)

Citation

Publication of pre-processed dataset: Hou, W., et al., (2020) A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol 21: 218. DOI: 10.1186/s13059-020-02132-x
Publication of processed dataset: Ho., N., et al. (2024) Scaling Dense Representations for Single Cell with Transcriptome-Scale Context. bioRxiv 2024.11.28.625303; DOI: 10.1101/2024.11.28.625303

Dataset Card Authors

Nicholas Ho, Caleb Ellington, and Elijah Cole (GenBio AI)

Dataset Card Contact

Caleb Ellington caleb.ellington@genbio.ai

Uses

Primary Use Cases

Evaluating models on cell type classification

Out-of-Scope or Unauthorized Use Cases

Do not use the dataset for the following purposes:

Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
Any discriminatory analyses.
Any use that is not in accordance with the Acceptable Use Policy.
Any use that is prohibited by the CC-BY 4.0 license.

Dataset Structure

The Zheng68k dataset from PBMCs includes scRNA-seq data obtained with the 10x Genomics GemCode platform. The processed dataset is in H5ad format and samples are randomly split into train/valid/test sets without any stratified splitting.

Personal and Sensitive Information

The dataset was de-identified by original data generators (Zheng et al.) and does not contain personal or sensitive information.

Dataset Creation

Curation Rationale

The processed Zheng68K dataset was curated to evaluate model performance for cell type classification. The dataset was originally created to characterize immune cell heterogeneity by profiling the transcriptomes of ~68,000 PBMCs from a healthy donor using a high-throughput droplet-based single-cell RNA sequencing platform.

Source Data

The Zheng68K source dataset includes scRNA-seq data generated from human PBMCs using the 10x Genomics GemCode platform. The source dataset can be downloaded from the 68K PBMC Analysis GitHub. The dataset was further processed by Hou et al., to evaluate scRNA-seq imputation methods. This pre-processed dataset can be downloaded from the Imputation Benchmark Github (subject to its own license).

Who are the source data producers?

The data was generated by authors of the original study describing single-cell transcriptomes from PBMCs. Authors included researchers at 10x Genomics in collaboration with scientists at the Fred Hutchinson Cancer Research Center and the University of Washington.

Data Collection and Processing

Zheng et al. collected fresh PBMCs from a healthy donor and obtained scRNA-seq data using the 10x Genomics GemCode platform. Approximately 8-9k cells were captured in each of the 8 microfluidic channels, yielding ~68,000 cells. RNA was reverse-transcribed in droplets (GEMs), and cDNA libraries were constructed using barcoded gel beads and sequenced on Illumina platforms. Hou et al. further processed the data as follows to evaluate imputation methods. Cells with at least 500 detected genes were kept, mitochondrial genes were removed, and genes expressed in at least 1% of cells were kept. The final pre-processed dataset, which includes 65,693 PBMCs, was downloaded from the Imputation Benchmark Github repository. For evaluation of AIDO.Cell and scFoundation model performance, the authors aligned gene expression profiles to a reference set of 19,264 genes to further improve annotations. The final aligned processed Zheng68K dataset used to evaluate cell type classification can be downloaded from HuggingFace.

Annotation Process

Cell types were annotated based on unsupervised clustering (PCA, t-SNE, k-means) followed by marker gene expression analysis.

Biases, Risks, and Limitations

Potential Biases

This data was collected to answer specific biological questions unrelated to its current use in benchmarking.
The donors in this dataset may not be representative of diverse populations.

Risks

Areas of risk include but are not limited to:

Cell type annotation can be difficult and relies on the judgement of the annotator. Some labels may be incorrect or inconsistent with labels in other datasets.

Limitations

This dataset provides a specific instance of a specific cell type classification task. High performance on this dataset does not necessarily indicate that a model will perform well on other tasks.

Caveats and Recommendations

Users should carefully consider the biological context and study design of the contributing datasets before drawing conclusions.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.

Acknowledgements

The authors acknowledge the contributions of their respective institutions and funding bodies.