Benchmarking Dataset from Tabula Sapiens v2

Version v1.0,

processed

released 26 Mar 2025

Tabula Sapiens is a reference human cell atlas containing single-cell transcriptomic data representing over 1.1M cells from 28 tissues collected across 24 healthy human subjects. This data card describes a benchmarking dataset from Tabula Sapiens v2. The Tabula Sapiens v2 dataset was generated to expand the original Tabula Sapiens 1.0 atlas, resulting in Tabula Sapiens 2.0. The processed Tabula Sapiens v2 dataset described here includes over 500,000 cells representing 27 tissues sampled from male (n = 2) and female (n = 7) donors.

Dataset Overview

Data Type

Single-cell RNA sequencing

Citation

Publication of source data: Tabula Sapiens Consortium et al. (2024) Tabula Sapiens reveals transcription factor expression, senescence effects, and sex-specific features in cell types from 28 human organs and tissues. bioRxiv 2024.12.03.626516; DOI: https://doi.org/10.1101/2024.12.03.626516.

Source Data Download Date

The original Tabula Sapiens v2 dataset was downloaded on February 7, 2025.

Dataset Card Authors

Karyna Rosario Cora and Ellaine Chou (CZI)

Dataset Card Contact

Ellaine Chou echou@chanzuckerberg.com

Uses

Primary Use Cases

  • Evaluation of model performance for cell clustering, classification, and metadata prediction (e.g., cell type) based on gene expression counts.

Out-of-Scope or Unauthorized Use Cases

Do not use the dataset for the following purposes:

  • Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
  • Any use that is prohibited by the CC-BY 4.0 license.

Intended Users

  • Model developers, researchers, and scientists

Dataset Structure

There are two dataset versions of Tabula Sapiens data. The first version, v1, was used to generate the original Tabula Sapiens 1.0 reference cell atlas (see publication), while v2 was used to expand the atlas to the more comprehensive Tabula Sapiens 2.0. The Tabula Sapiens v2 dataset includes 10X and SmartSeq single-cell RNA sequencing data representing 27 tissues from 9 donors. The data is in H5ad format and is split by tissue (defined by CELLxGENE) and donor.

Personal and Sensitive Information

The Tabula Sapiens v2 dataset only includes de-identified single-cell RNA sequencing data. The de-identified data was directly retrieved from the Tabula Sapiens collection in CELLxGENE.

Dataset Creation

Curation Rationale

The Tabula Sapiens v2 dataset was used for model benchmarking and evaluation. Therefore, the dataset was curated to ensure inclusion of required metadata, removal of the Tabula Sapiens v1 samples that have been seen in model training, as well as format consistency and compatibility across datasets used for model evaluation.

Source Data

Original data was downloaded in H5ad format from the CELLxGENE Tabula Sapiens collection. The source data can be directly downloaded by clicking this CELLxGENE file link.

Who are the source data producers?

The Tabula Sapiens reference dataset was created by the Tabula Sapiens Consortium, a team of more than 160 experts led by scientists at the Chan Zuckerberg Biohub San Francisco.

Data Collection and Processing

The Tabula Sapiens v2 dataset was downloaded in H5ad format. Gene count tables were combined with required metadata variables using the Scanpy Python package. Briefly, the dataset was processed to:

  • remove ambiguous or unannotated cell types (i.e., cell_type = ["cell", "unassigned", "unknown", "Unclassified"])
  • remove genes with zero counts across cells
  • remove cells with fewer than 10 genes across cells per donor
  • strip any suffixes from Ensembl IDs
  • replace normalized counts with raw counts
  • validate raw counts

Over 500,000 cells representing 26 tissues remained after curation and filtering (see data processing script). The pancreas is the only tissue not represented in the processed Tabula Sapiens v2 dataset due to low cell counts (< 50 cells post-filtering).

Annotation Process

The Tabula Sapiens reference cell atlas was annotated by a large group of experts using CELLxGENE. Each data object contained three main components: gene count data, cell-wise metadata, and gene-wise metadata for the organ of interest. The group of experts used a defined cell ontology terminology to annotate cell types consistently across tissues. Quality control was performed on manual annotations using PopV.

Who are the annotators?

Researchers from the Tabula Sapiens Consortium were the annotators of the source dataset.

Biases, Risks, and Limitations

Potential Biases

  • The dataset may overrepresent certain tissue types.
  • The dataset mainly includes samples from female donors (7 female donors vs. 2 male donors).
  • Technical biases inherent to single-cell RNA sequencing technology, such as dropout events or under-detection of low-abundance transcripts, may be present.

Limitations

  • The dataset is limited in scope and does not capture all cell types or conditions present in human tissues.
  • The dataset is focused on single-cell and spatial gene expression and does not include additional data types (e.g., protein expression or open chromatin information).

Caveats and Recommendations

  • Users should carefully consider the biological context and study design of the contributing datasets before drawing conclusions.
  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.

Acknowledgements

We acknowledge the CZ CELLxGENE and the Stanford Lattice teams for processing and curating the Tabula Sapiens data. The Tabula Sapiens reference cell atlas was funded by grants from the Chan Zuckerberg Initiative and Silicon Valley Community Foundation and supported by the Chan Zuckerberg Biohub San Francisco. Thanks to all the donors and organizations that contributed to sample processing and data collection, including the Donor Network West and UCSF Liver Center.