CZ CELLxGENE Discover Census

Version 2024-07-01,
source
released 01 Jul 2024

License

Creative Commons Attribution (CC-BY 4.0)

Repository

https://github.com/chanzuckerberg/cellxgene-census

Developed By

Chan Zuckerberg Initiative

This dataset is a large-scale, single-cell RNA sequencing (scRNA-seq) resource that integrates data from over 44 million primary human cells and 16 million mouse cells. The dataset is designed to support large-scale computational analysis of cellular diversity and dynamics in humans. It enables researchers to query and explore cellular gene expression profiles across multiple biological contexts (i.e. ancestry, age, tissue, and cell type), providing insights into cell type-specific gene expression, tissue organization, and disease mechanisms.

Visit Dataset Source

Dataset Overview

Data Type

Single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics data.

Citation

Publication: CZ CELL×GENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data. bioRxiv (2023). https://doi.org/10.1101/2023.10.30.563174

Dataset Card Authors

CZI Single-Cell Biology Program, CZI Science Technology team, Lattice Curation Team

Dataset Card Contact

cellxgene@chanzuckerberg.com

Uses

Primary Use Cases

Single-Cell Transcriptomics Research: Exploration of gene expression at the single-cell level across various tissues and conditions.
Cell Type Discovery: Identifying and characterizing novel or rare cell types based on gene expression profiles.
Disease Research: Investigating how cellular gene expression changes in response to disease or treatment conditions.
Tissue-Specific Expression Analysis: Studying the gene expression patterns of specific tissues and organs in healthy and diseased states.
Spatial Transcriptomics Analysis: Investigating spatially resolved patterns of gene expression using spatial data from Visium and Slide-seq assays.
Model Training and Computational Method Development: Leveraging large-scale single-cell and spatial datasets to train machine learning models, develop novel computational methods, and test algorithms for applications like cell state prediction, data integration, and in silico experimentation.

Out-of-Scope or Unauthorized Use Cases

Use without attribution, the user may do anything else they wish, at their own risk.
Any use that is not in accordance with the Acceptable Use Policy

Dataset Structure

The dataset consists of:

Single-cell and spatial RNA sequencing data for over 74 million human cells.
Cell-level metadata, including tissue origin, disease state, and experimental conditions.
Gene expression matrices containing raw and normalized expression values for thousands of genes across individual cells.

Read more about the Census Schema and the TileDB-SOMA implementation.

Personal and Sensitive Information

The dataset contains de-identified data; ethical guidelines have been followed for human data use, and no personally identifiable information is included.

Dataset Creation

Curation Rationale

The dataset was curated to provide a comprehensive and accessible resource for studying human single-cell gene expression across multiple biological contexts. The goal is to facilitate data re-use through large-scale computational analyses and model building to enable discoveries related to cell biology and disease.

Source Data

scRNA-seq and spatial data from multiple human tissue samples, aggregated from public studies, including healthy and diseased samples.

Who are the source data producers?

The original data comes from a variety of research teams contributing to publicly available single-cell RNA sequencing datasets, standardized by the CZ CELLxGENE and Lattice teams. For the information please refer the CZ CELLxGENE Discover preprint.

Data Collection and Processing

Data was collected from human and mouse tissue samples processed using single-cell and spatial RNA sequencing technologies. The raw and author-processed data was then aggregated using the CZ CELLxGENE platform, allowing for cross-study comparisons and integrative analyses. For spatial assays, additional metadata like spatial coordinates and image data were included where applicable.

Annotation Process

Cell type annotations are provided by the original authors according to their chosen annotation methods. These methods may involve expert knowledge using canonical markers or automated classification tools such as CellTypist. These initial annotations are stored in the "author_cell_type" field within the obs column of the original anndata objects submitted to the CZ CELLxGENE portal.

Upon submission, the CZ CELLxGENE curation team works with dataset authors to standardize the annotations using the CL ontology, ensuring consistency and interoperability. These standardized annotations are stored in the "cell_type" field, providing a reliable resource for further analysis. Both original and standardized annotations are accessible through the CZ CELLxGENE Discover portal or API.

Who are the annotators?

Researchers are the primary curators, generating initial annotations based on their datasets. The Lattice curation team coordinates standardization and quality assurance.

Bias, Risks, and Limitations

Potential Biases:
- The dataset may overrepresent certain tissue types or disease conditions, as it depends on the availability of public scRNA-seq studies.
- Technical biases inherent to scRNA-seq technology, such as dropout events or under-detection of low-abundance transcripts, may be present.
Limitations:
- Dataset may not capture all cell types or conditions present in human tissues due to limited coverage from contributing studies
- Dataset is focused on single-cell and spatial gene expression and does not include additional data types like protein expression or open chromatin information.
- Limited to the quality and scope of the publicly available datasets it aggregates.

Caveats and Recommendations

Users should carefully consider the biological context and study design of the contributing datasets before drawing conclusions. For robust insights, it is recommended to validate findings experimentally or with complementary datasets.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.

Acknowledgements

CZ CELLxGENE Team, Lattice, and all contributing authors to the corpus.