Mouse Kidney Benchmark Dataset

Version v1.0,

processed

released 15 Sept 2025

The original dataset generated by Li et al., 2022 includes single-cell combinatorial indexing RNA sequencing (sci-RNA-seq) data collected from the nuclei of 24 mouse kidneys. The kidneys were from mice subjected to either unilateral ischemia-reperfusion injury (uni-IRI) or unilateral ureteral obstruction (UUO) to model kidney fibrosis. The dataset was processed to obtain two versions used to evaluate model performance for cross-species disease label transfer. This data card describes two variations of the processed dataset: Mouse Kidney and Mouse Kidney - Human Orthologs.

Dataset Overview

Data Type

Single-nucleus RNA sequencing

Citation

Publication of source dataset: Li, H. et al. (2022) Comprehensive single-cell transcriptional profiling defines shared and unique epithelial injury responses during kidney fibrosis. Cell Metabolism 6:1977-1998.e9. DOI: 10.1016/j.cmet.2022.09.026

Source Dataset Download Date

The original dataset was downloaded on July 26, 2025.

Dataset Card Authors

Karyna Rosario Cora and Ellaine Chou (CZI)

Dataset Card Contact

Ellaine Chou (echou@chanzuckerberg.com)

Uses

Primary Use Cases

  • Evaluation of model performance for cross-species disease label transfer

Out-of-Scope or Unauthorized Use Cases

Do not use the dataset for the following purposes:

  • Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
  • Any use that is prohibited by the CC-BY 4.0 license.

Dataset Structure

The dataset was processed for model evaluation using two distinct versions: the complete dataset (Mouse Kidney) and a version with only human-mouse gene orthologs (Mouse Kidney - Human Orthologs). Each of the processed dataset versions is in H5ad format.

Personal and Sensitive Information

The dataset was de-identified by original authors and only includes RNA data and associated metadata in H5ad format.

Dataset Creation

Curation Rationale

The dataset is used for model benchmarking. The dataset was curated to ensure compliance with evaluated model requirements, inclusion of required metadata, format consistency, and compatibility across datasets used for model evaluation.

Source Data

The raw reads were downloaded from SRA (BioProject PRJNA788883).

Who are the source data producers?

The dataset was generated by Li et al., 2022.

Data Collection and Processing

The original dataset including single-nucleus RNA sequencing data was sequenced on a single NovaSeq 6000 flow cell, enabling authors to profile over 300,000 cells from 24 kidneys in a single experiment. The raw reads were downloaded from SRA and processed to generate H5ad files that match the original H5ad file published by Li et al. in GEO (GEO Series accession number GSE190887). After demultiplexing the data, gene counts were generated using STARsolo (configured for nuclear data) and following guidelines for sci-RNA-seq3 data. The resulting counts were then combined into a final count matrix and filtered to match the original cell population reported by Li et al., 2022. Briefly, the dataset was filtered to:

  • remove cells with fewer than 1000 reads
  • remove genes with fewer than 1000 reads across cells
  • remove cells with fewer than 1000 genes across cells per donor
  • curate disease prediction labels
  • test dataset compatibility with evaluated models

After curation and filtering, 27560 cells remained (see data processing scripts).

The processed datasets can be downloaded from AWS:

  • Mouse Kidney (27560 cells, 8060 genes): s3://cz-benchmarks-data/datasets/v1/kidney_disease/GSE190887/GSE190887_recurated_benchmark_v1.0.h5ad
  • Mouse Kidney - Human Orthologs (27560 cells, 7265 genes): s3://cz-benchmarks-data/datasets/v1/kidney_disease/GSE190887/GSE190887_recurated_human_orthologs_benchmark_v1.0.h5ad

Annotation process

The processed dataset retains the cell types and sample annotations originally reported by Li et al., 2022. Li et al. initially identified 19 major cell clusters through unsupervised clustering and annotated the data based on the expression of known marker genes. The authors then performed a more detailed subclustering analysis on the major cell clusters to identify a total of 50 cell types and states. The annotations were confirmed by integrating the dataset with a scRNAseq dataset from a different mouse model of kidney injury (bi-IRI), inspecting the expression of lineage-specific genes, and using an external computational framework for visualization and validation (see paper for details). For model benchmarking purposes, the cell types were mapped to Cell Ontology (CL terms) and downstream analysis were carried out in accordance with CELLxGENE schema v6.0.0.

Who are the annotators?

The dataset was annotated by original data generators (see Li et al., 2022). Additional annotations were provided by CZI.

Biases, Risks, and Limitations

Potential Biases

  • Technical biases inherent to single-nucleus RNA sequencing technologies may be present.

Limitations

  • The processed dataset only captures gene expression from mouse kidney cells collected from adult male mice and represents two fibrogenesis models, uni-IRI and UUO. This limits the generalizability of the dataset.

Caveats and Recommendations

  • Users should carefully consider the biological context and study design of the dataset before drawing conclusions.
  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.

Acknowledgements

We thank the original authors and contributors for their efforts in assembling this valuable Spermatogenesis dataset and making it accessible to the community.