Try Models

K562 Essential Perturb-seq Benchmark Dataset

Version v1.0, processed
released 03 Sept 2025

The original dataset is part of a large-scale genotype-phenotype map developed by Replogle et. al. in 2022. The dataset specifically includes gene expression profiles from the human chronic myeloid leukemia cell line (K562 cells) after genetic perturbation of essential genes using CRISPR interference. The dataset was processed to benchmark models performing genetic perturbation prediction tasks.

Developed By

NVIDIA Corporation

Dataset Overview

Data Type

Perturb-seq data

Citation

Publication of source dataset: Replogle, J. M. et al. (2022) Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell 185: 2559 - 2575.e28. DOI: 10.1016/j.cell.2022.05.013.

Source Dataset Download Date

The original dataset was downloaded on August 1, 2025.

Dataset Card Authors

Karyna Rosario Cora (CZI) and Michelle Gill (NVIDIA)

Dataset Card Contact

virtualcellmodels@chanzuckerberg.com

Uses

Primary Use Cases

  • Evaluation of model performance for genetic perturbation prediction

Out-of-Scope or Unauthorized Use Cases

Do not use the dataset for the following purposes:

  • Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
  • Any use that is prohibited by the CC BY NA SA 4.0 license.

Dataset Structure

The processed dataset is a single H5ad file. The .X matrix contains the per-cell gene perturbation counts as provided by the authors. Within this file, there is unstructured data that contains two sets of differentially expressed genes for each condition, one determined using Wilcoxon rank-sum test and the second by t-test. The unstructured data also contains a mapping between control and target cells for each condition as determined by GEM group and library size (UMI count).

Personal and Sensitive Information

The dataset was de-identified by original authors and only includes RNA data and associated metadata in H5ad format.

Dataset Creation

Curation Rationale

The K562 Essential Perturb-seq dataset was used for model benchmarking. The dataset was curated to ensure compliance with evaluated model requirements, inclusion of required metadata, format consistency, and compatibility across datasets used for model evaluation.

Source Data

The original data file (K562_essential_raw_singlecell_01.h5ad) was downloaded in H5ad format from Figshare from the URL provided in the original publication (Replogle et al., 2022).

Who are the source data producers?

The original dataset was generated by Replogle et al., 2022.

Data Collection and Processing

The dataset was originally generated as part of a larger effort to create a genome-scale genotype-phenotype map. The dataset includes Perturb-seq data from K562 cells screened for a period of six days post-transduction. After the screening period, the cells were analyzed using droplet-based scRNA-seq with direct guide capture to link each cell's unique genetic perturbation to its transcriptional profile. After sequencing, a computational framework was used to align the reads, identify cells, and assign the single-guide RNAs to each cell. This resulted in a dataset that included a median of >100 cells per perturbation after filtering. The dataset, as provided by the authors, was further processed to:

  • add the matched control cell IDs to the AnnData unstructured data
  • add the two differential gene expression analysis results to the AnnData unstructured data
  • normalize column names in the metadata (obs) to those expected by the framework

Annotation process

The data was annotated by clustering genetic perturbations based on their transcriptional phenotypes (see paper for details).

Who are the annotators?

The dataset was annotated by original data generators (see Replogle et al., 2022).

Biases, Risks, and Limitations

Risks

  • Some of the observed phenotypes could be a result of unintended effects, such as a neighboring gene knockdown, rather than the direct detection of off-target activities.

Limitations

  • The dataset includes a limited number of cells per perturbation, which may reduce the statistical power to detect subtle phenotypes.
  • The dataset is limited in scope as it only captures data from experiments conducted at a limited number of time points and only in K562 cells.

Caveats and Recommendations

  • Users should carefully consider the biological context and study design of the contributing datasets before drawing conclusions.
  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using our services.

Acknowledgements

Michelle Gill, Polina Binder, and Jasleen Grewal