Allen + Sound Life Benchmark Dataset

Version v1.0,

processed

released 15 Sept 2025

The original dataset includes longitudinal single-cell RNA sequencing (scRNA-seq) profiles from healthy young (25–35 years) and older (55–65 years) adults. It captures immune cell dynamics across age using over 13 million peripheral blood mononuclear cells (PBMCs) sampled over two years. The dataset was processed to obtain two versions used to evaluate model embedding consistency over sequential or temporal labels as well as metadata label prediction. This data card describes both versions of the processed dataset: Allen+Sound Life - immune_variation and Allen+Sound Life - flu_vax_response.

Dataset Overview

Data Type

RNA sequencing data

Citation

Gong, Q. et al. (2024) Longitudinal Multi-omic Immune Profiling Reveals Age-Related Immune Cell Dynamics in Healthy Adults. bioRxiv 2024.09.10.612119. DOI:10.1101/2024.09.10.612119

Source Dataset Download Date

The original dataset was downloaded on September 2, 2025.

Dataset Card Authors

Karyna Rosario Cora (CZI), Katrina Kalantar (CZI), and Kasia Kedzierska (Allen Institute)

Dataset Card Contact

Kasia Kedzierska (kasia.kedzierska@alleninstitute.org)

Uses

Primary Use Cases

  • Model evaluation for ensuring embedding consistency over sequential or temporal labels

Out-of-Scope or Unauthorized Use Cases

Do not use the dataset for the following purposes:

  • Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
  • Any use that is prohibited by the CC-BY 4.0 license.

Dataset Structure

The data files are in H5ad format and contain sample and subject metadata, cell type labels and QC metrics. The data was aggregated across patient age groups.

Personal and Sensitive Information

The dataset only contains non-identifiable metadata (e.g., age range, sex, and ethnicity). No personally identifiable information is included. A full description of the available metadata can be found in the Sound Life download page.

Dataset Creation

Curation Rationale

The dataset is used for model benchmarking. The dataset was curated to ensure compliance with evaluated model requirements, inclusion of required metadata, format consistency, and compatibility across datasets used for model evaluation.

Source Data

The original data was downloaded in H5ad format from the Sound Life scRNA-seq Data download page.

Who are the source data producers?

The original dataset was generated by Gong et al., 2024.

Data Collection and Processing

The original dataset was obtained from blood samples collected at up to 10 time points over two years from 96 healthy adults. The sampling occurred before and after two seasonal influenza vaccinations, as well as during non-vaccination periods. The libraries were sequenced on a 10x Genomics 3' scRNA-seq Platform (v3.1). The resulting scRNA-seq data was processed to remove low quality cells, including:

  • removal of cells identified as doublets
  • removal of cells with more than 10% of total Unique Molecular Identifiers (UMIs) from mitochondrial genes
  • removal of cells with too few (fewer than 200) or too many (more than 2,500) detected genes

The dataset was further processed to develop two dataset variations, immune_variation and flu_vax_response, used for model benchmarking purposes (see data processing script). Below we briefly describe the processing steps to obtain each dataset. For both datasets, the following OBS variables were kept: barcode, biologicalSex, cmv, bmi, ageAtFirstDraw, visitName, subjectAgeAtDraw, ageGroup, batch_id. The following OBS variables were updated: AIF1_L1 = cell_type_level_1, AIF1_L2 = cell_type_level_2, AIF1_L3 = cell_type_level_3.

Immune_variation:

  • Subject group .h5ad files were downloaded for all groups and T cells were identified from each. These T cell subsets were aggregated into a single file.
  • The resulting T cell file was sampled by donor_id to only include samples from the “Immune Variation Day 0" visit and downsampled to meet an approximate 600K cell count threshold to reduce dataset size. This resulted in a dataset of 604,704 cells obtained from 89 donors (filename: allen_soundlife_immune_variation.h5ad)
  • The processed dataset described above was subsampled to 9,483 cells to create a smaller subset for model testing (filename: allen_soundlife_immune_variation_subsampled.h5ad).

Flu_vax_response:

  • Subject group .h5ad files were downloaded for all groups and B cells were identified from each. These B cell subsets were aggregated into a single file.
  • The resulting B cell file was sampled by donor_id to only include samples from the “Immune Variation Day 0" visit and downsampled to meet an approximate 600K cell count threshold to reduce dataset size. This resulted in a dataset of 587,517 cells obtained from 82 donors (filename: allen_soundlife_flu_response.h5ad)
  • The processed dataset described above was subsampled to 7,384 cells to create a smaller subset for model testing (filename: allen_soundlife_flu_response_subsampled.h5ad)

Both dataset versions, including subsets for model testing, can be downloaded from AWS:

  • Allen+Sound Life - immune_variation: s3://cz-benchmarks-data/datasets/v1/allen_soundlife/allen_soundlife_immune_variation.h5ad and s3://cz-benchmarks-data/datasets/v1/allen_soundlife/allen_soundlife_immune_variation_subsampled.h5ad
  • Allen+Sound Life - flu_response: s3://cz-benchmarks-data/datasets/v1/allen_soundlife/allen_soundlife_flu_response.h5ad and s3://cz-benchmarks-data/datasets/v1/allen_soundlife/allen_soundlife_flu_response_subsampled.h5ad

Annotation process

The processed datasets retain the cell types and sample annotations reported by Gong et al., 2024. The original dataset was annotated through a multi-step procedure involving automated and manual methods to define immune cell types and states. Gong et al., 2024 first built a reference Human Immune Health Atlas using scRNA-seq data from 108 healthy donors. Cells within this reference atlas were categorized using unsupervised clustering and the identification of distinct immune-based marker genes. Seventy-one highly specific immune cell subsets were identified. Authors then used the reference atlas to train custom models using the CellTypist framework to automatically assign the 71 high-resolution cell labels to the millions of PBMCs from the larger longitudinal cohort (see paper for details). For benchmarking purposes, the original columns in the H5ad file were renamed from AIFI_L[1-3] to cell_type_level[1-3] and the mapping was saved to [immune_variation|flu_vax_response]_adata.uns['original_column_name_mapping'].

Who are the annotators?

The original dataset was annotated by data generators (see Gong et al., 2024).

Biases, Risks, and Limitations

Limitations

  • The longitudinal cohort used to generate the dataset consists of healthy adults from a specific age range and geographic area: young adults (25-35 years old) and older adults (55-65 years old) from the greater Seattle, Washington, USA area. This limited demographic range may affect the generalizability of the findings to other populations and age groups.

Caveats and Recommendations

  • The samples used to generate the dataset were collected during the COVID-19 pandemic, a period of significantly reduced exposure to common viruses. This unique context could mean that the immune landscape observed may not be fully representative of typical immune remodeling in other time periods.
  • Users should carefully consider the biological context and study design of the dataset before drawing conclusions.
  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.

Acknowledgements

The Sound Life dataset was supported by the Allen Institute, Benaroya Research Institute and by a National Institute on Aging award.