Human Kidney Benchmark Dataset
Version v1.0, processed
released 15 Sept 2025
processed
released 15 Sept 2025License
CC BY 4.0Developed By
CZI
The original dataset generated by Lake et al., 2023 includes single-cell (sc) and single-nucleus (sn) RNA sequencing data collected from healthy reference kidneys (45 donors) and kidneys from 48 patients with acute kidney failure or chronic kidney disease. The processed dataset described here is used to evaluate model performance on cross-species disease label transfer.
Dataset Overview
Data Type
Single-cell and single-nucleus RNA sequencing
Citation
Publication of source dataset: Lake, B.B. et al., (2023) An atlas of healthy and injured cell states and niches in the human kidney. Nature 619: 585-594. DOI: 10.1038/s41586-023-05769-3
Source Dataset Download Date
The original dataset was downloaded on May 12, 2025.
Dataset Card Authors
Karyna Rosario Cora and Ellaine Chou (CZI)
Dataset Card Contact
Ellaine Chou (echou@chanzuckerberg.com)
Uses
Primary Use Cases
- Evaluation of model performance for cross-species disease label transfer
Out-of-Scope or Unauthorized Use Cases
Do not use the dataset for the following purposes:
- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
- Any use that is prohibited by the CC-BY 4.0 license.
Dataset Structure
The processed dataset is in H5ad format.
Personal and Sensitive Information
The dataset was de-identified by original authors and only includes RNA data and associated metadata in H5ad format.
Dataset Creation
Curation Rationale
The dataset is used for model benchmarking. The dataset was curated to ensure compliance with evaluated model requirements, inclusion of required metadata, format consistency, and compatibility across datasets used for model evaluation.
Source Data
The original data was downloaded in H5ad format from CZ CELLxGENE. The source data can be directly downloaded by clicking this CZ CELLxGENE file link.
Who are the source data producers?
The original dataset was generated by Lake et al., 2023.
Data Collection and Processing
The dataset includes single-cell and single-nucleus RNA sequencing data obtained through droplet-based transcriptomic assays (Chromium v3) for single nuclei and single cells. The resulting RNA data from 304,652 cells was downloaded in H5ad format. Gene count tables were combined with required metadata variables. Briefly, the dataset was processed to:
- remove cells with fewer than 1000 reads
- remove genes with fewer than 1000 reads across cells
- remove cells with fewer than 1000 genes across cells per donor
- curate disease prediction labels
- test dataset compatibility with evaluated models
After curation and filtering, 208,354 cells remained (see data processing script).
Annotation process
Lake et al., 2023 originally annotated the data through a multi-step process that began with unsupervised clustering of the snRNA sequencing data. These initial clusters were then assigned to 77 different cell subclasses using known cell type markers and regional information from the kidney. Authors used the snRNA data as a reference to integrate and transfer annotations to scRNA sequencing and SNARE-seq2 data. This integration helped ensure the accuracy and consistency of the cell annotations across different technologies and was validated through correlations with published datasets and anatomical locations. Authors also defined and annotated "altered states" to categorize cells affected by injury or disease. See paper for details. For benchmarking purposes, additional annotations were done following CZ CELLxGENE data standards (see CZ CELLxGENE Documentation).
Who are the annotators?
The dataset was annotated by original data generators (see Lake et al., 2023). Additional annotations were provided by the Stanford Lattice team.
Biases, Risks, and Limitations
Potential Biases
- Technical biases inherent to single-cell and single-nucleusRNA sequencing technologies may be present.
Limitations
- The processed dataset only captures gene expression from kidney cells.This limits the generalizability of the dataset.
Caveats and Recommendations
- Users should carefully consider the biological context and study design of the dataset before drawing conclusions.
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.
Acknowledgements
We acknowledge all authors and contributors of the original dataset for making this valuable resource available to the community.