Human Kidney Benchmark Dataset

Version v1.0,

processed

released 15 Sept 2025

The original dataset generated by Lake et al., 2023 includes single-cell (sc) and single-nucleus (sn) RNA sequencing data collected from healthy reference kidneys (45 donors) and kidneys from 48 patients with acute kidney failure or chronic kidney disease. The processed dataset described here is used to evaluate model performance on cross-species disease label transfer.

Dataset Overview

Data Type

Single-cell and single-nucleus RNA sequencing

Citation

Publication of source dataset: Lake, B.B. et al., (2023) An atlas of healthy and injured cell states and niches in the human kidney. Nature 619: 585-594. DOI: 10.1038/s41586-023-05769-3

Source Dataset Download Date

The original dataset was downloaded on May 12, 2025.

Dataset Card Authors

Karyna Rosario Cora and Ellaine Chou (CZI)

Dataset Card Contact

Ellaine Chou (echou@chanzuckerberg.com)

Uses

Primary Use Cases

  • Evaluation of model performance for cross-species disease label transfer

Out-of-Scope or Unauthorized Use Cases

Do not use the dataset for the following purposes:

  • Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
  • Any use that is prohibited by the CC-BY 4.0 license.

Dataset Structure

The processed dataset is in H5ad format.

Personal and Sensitive Information

The dataset was de-identified by original authors and only includes RNA data and associated metadata in H5ad format.

Dataset Creation

Curation Rationale

The dataset is used for model benchmarking. The dataset was curated to ensure compliance with evaluated model requirements, inclusion of required metadata, format consistency, and compatibility across datasets used for model evaluation.

Source Data

The original data was downloaded in H5ad format from CZ CELLxGENE. The source data can be directly downloaded by clicking this CZ CELLxGENE file link.

Who are the source data producers?

The original dataset was generated by Lake et al., 2023.

Data Collection and Processing

The dataset includes single-cell and single-nucleus RNA sequencing data obtained through droplet-based transcriptomic assays (Chromium v3) for single nuclei and single cells. The resulting RNA data from 304,652 cells was downloaded in H5ad format. Gene count tables were combined with required metadata variables. Briefly, the dataset was processed to:

  • remove cells with fewer than 1000 reads
  • remove genes with fewer than 1000 reads across cells
  • remove cells with fewer than 1000 genes across cells per donor
  • curate disease prediction labels
  • test dataset compatibility with evaluated models

After curation and filtering, 208,354 cells remained (see data processing script).

Annotation process

Lake et al., 2023 originally annotated the data through a multi-step process that began with unsupervised clustering of the snRNA sequencing data. These initial clusters were then assigned to 77 different cell subclasses using known cell type markers and regional information from the kidney. Authors used the snRNA data as a reference to integrate and transfer annotations to scRNA sequencing and SNARE-seq2 data. This integration helped ensure the accuracy and consistency of the cell annotations across different technologies and was validated through correlations with published datasets and anatomical locations. Authors also defined and annotated "altered states" to categorize cells affected by injury or disease. See paper for details. For benchmarking purposes, additional annotations were done following CZ CELLxGENE data standards (see CZ CELLxGENE Documentation).

Who are the annotators?

The dataset was annotated by original data generators (see Lake et al., 2023). Additional annotations were provided by the Stanford Lattice team.

Biases, Risks, and Limitations

Potential Biases

  • Technical biases inherent to single-cell and single-nucleusRNA sequencing technologies may be present.

Limitations

  • The processed dataset only captures gene expression from kidney cells.This limits the generalizability of the dataset.

Caveats and Recommendations

  • Users should carefully consider the biological context and study design of the dataset before drawing conclusions.
  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.

Acknowledgements

We acknowledge all authors and contributors of the original dataset for making this valuable resource available to the community.