Mouse Kidney Benchmark Dataset
Version v1.0, processed
released 15 Sept 2025
processed
released 15 Sept 2025License
CC BY 4.0Developed By
CZI
The original dataset generated by Li et al., 2022 includes single-cell combinatorial indexing RNA sequencing (sci-RNA-seq) data collected from the nuclei of 24 mouse kidneys. The kidneys were from mice subjected to either unilateral ischemia-reperfusion injury (uni-IRI) or unilateral ureteral obstruction (UUO) to model kidney fibrosis. The dataset was processed to obtain two versions used to evaluate model performance for cross-species disease label transfer. This data card describes two variations of the processed dataset: Mouse Kidney and Mouse Kidney - Human Orthologs.
Dataset Overview
Data Type
Single-nucleus RNA sequencing
Citation
Publication of source dataset: Li, H. et al. (2022) Comprehensive single-cell transcriptional profiling defines shared and unique epithelial injury responses during kidney fibrosis. Cell Metabolism 6:1977-1998.e9. DOI: 10.1016/j.cmet.2022.09.026
Source Dataset Download Date
The original dataset was downloaded on July 26, 2025.
Dataset Card Authors
Karyna Rosario Cora and Ellaine Chou (CZI)
Dataset Card Contact
Ellaine Chou (echou@chanzuckerberg.com)
Uses
Primary Use Cases
- Evaluation of model performance for cross-species disease label transfer
Out-of-Scope or Unauthorized Use Cases
Do not use the dataset for the following purposes:
- Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
- Any use that is prohibited by the CC-BY 4.0 license.
Dataset Structure
The dataset was processed for model evaluation using two distinct versions: the complete dataset (Mouse Kidney) and a version with only human-mouse gene orthologs (Mouse Kidney - Human Orthologs). Each of the processed dataset versions is in H5ad format.
Personal and Sensitive Information
The dataset was de-identified by original authors and only includes RNA data and associated metadata in H5ad format.
Dataset Creation
Curation Rationale
The dataset is used for model benchmarking. The dataset was curated to ensure compliance with evaluated model requirements, inclusion of required metadata, format consistency, and compatibility across datasets used for model evaluation.
Source Data
The raw reads were downloaded from SRA (BioProject PRJNA788883).
Who are the source data producers?
The dataset was generated by Li et al., 2022.
Data Collection and Processing
The original dataset including single-nucleus RNA sequencing data was sequenced on a single NovaSeq 6000 flow cell, enabling authors to profile over 300,000 cells from 24 kidneys in a single experiment. The raw reads were downloaded from SRA and processed to generate H5ad files that match the original H5ad file published by Li et al. in GEO (GEO Series accession number GSE190887). After demultiplexing the data, gene counts were generated using STARsolo (configured for nuclear data) and following guidelines for sci-RNA-seq3 data. The resulting counts were then combined into a final count matrix and filtered to match the original cell population reported by Li et al., 2022. Briefly, the dataset was filtered to:
- remove cells with fewer than 1000 reads
- remove genes with fewer than 1000 reads across cells
- remove cells with fewer than 1000 genes across cells per donor
- curate disease prediction labels
- test dataset compatibility with evaluated models
After curation and filtering, 27560 cells remained (see data processing scripts).
The processed datasets can be downloaded from AWS:
- Mouse Kidney (27560 cells, 8060 genes):
s3://cz-benchmarks-data/datasets/v1/kidney_disease/GSE190887/GSE190887_recurated_benchmark_v1.0.h5ad
- Mouse Kidney - Human Orthologs (27560 cells, 7265 genes):
s3://cz-benchmarks-data/datasets/v1/kidney_disease/GSE190887/GSE190887_recurated_human_orthologs_benchmark_v1.0.h5ad
Annotation process
The processed dataset retains the cell types and sample annotations originally reported by Li et al., 2022. Li et al. initially identified 19 major cell clusters through unsupervised clustering and annotated the data based on the expression of known marker genes. The authors then performed a more detailed subclustering analysis on the major cell clusters to identify a total of 50 cell types and states. The annotations were confirmed by integrating the dataset with a scRNAseq dataset from a different mouse model of kidney injury (bi-IRI), inspecting the expression of lineage-specific genes, and using an external computational framework for visualization and validation (see paper for details). For model benchmarking purposes, the cell types were mapped to Cell Ontology (CL terms) and downstream analysis were carried out in accordance with CELLxGENE schema v6.0.0.
Who are the annotators?
The dataset was annotated by original data generators (see Li et al., 2022). Additional annotations were provided by CZI.
Biases, Risks, and Limitations
Potential Biases
- Technical biases inherent to single-nucleus RNA sequencing technologies may be present.
Limitations
- The processed dataset only captures gene expression from mouse kidney cells collected from adult male mice and represents two fibrogenesis models, uni-IRI and UUO. This limits the generalizability of the dataset.
Caveats and Recommendations
- Users should carefully consider the biological context and study design of the dataset before drawing conclusions.
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.
Acknowledgements
We thank the original authors and contributors for their efforts in assembling this valuable Spermatogenesis dataset and making it accessible to the community.