CZII CryoET Object Identification Benchmarking Dataset

Version v1.0,

source

released 30 Oct 2024

Developed By

Ariana Peck, Yue Yu, Jonathan Schwartz, Anchi Cheng, Utz Heinrich Ermel, Joshua Hutchings, Saugat Kandel, Dari Kimanius, Elizabeth Montabana, Daniel Serwas, Hannah Siems, Feng Wang, Zhuowen Zhao, Shawn Zheng, Matthias Haury, David Agard, Clinton S. Potter, Bridget Carragher, Kyle Harrington, Mohammadreza Paraan

The CZII CryoET Object Identification dataset is a benchmark dataset for evaluating cryoET particle picking algorithms. The dataset, called a cryoET “phantom” dataset, was used in the Kaggle competition organized by the Chan Zuckerberg Imaging Institute (CZII). The phantom dataset includes four datasets: Experimental training data (7 runs), Public test data (121 runs), Private test data (364 runs) and Simulated data (27 runs). Each run contains six particle types with ground truth annotations, including apo-ferritin, beta-amylase, beta-galactosidase, ribosome, thyroglobulin, and virus-like-particle. The entirety of the phantom dataset now lives on the CryoET Data Portal where raw tomogram data (e.g., tilt series), processed tomograms, ground truth labels, and competition winner labels can be easily found.

Dataset Overview

Data Type

Imaging (CryoET data)

Citation

Peck, A., et al., (2025) A Real-World Phantom Dataset to Spur Innovation in CryoET Data Annotation. Accepted for publication in Nature Methods. Preprint: DOI: 10.1101/2024.11.04.621686

Data Card Authors

Reza Paraan and Utz Heinrich Ermel (Chan Zuckerberg Imaging Institute)

Data Card Contact

Reza Paraan reza.paraan@czii.org

Uses

Primary Use Cases

  • Evaluation of model performance for particle picking and annotation from cryoET tomograms.

Out-of-Scope or Unauthorized Use Cases

Do not use the dataset for the following purposes:

  • Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
  • Any use that is prohibited by the CC0 license.

Intended Users

  • ML developers
  • CryoET tool developers

Dataset Structure

The phantom dataset was created as part of the Kaggle competition organized by CZII. To become familiar with general dataset organization, see the CryoET Portal Data Schema. There are four tomograms for each experimental run in the training, public test and private test datasets. All the tomograms were reconstructed using the same tomographic alignment. See the four tomograms included with run TS_100_4, which is part of the private test dataset, as an example. For the simulated dataset, only one tomogram is provided per run. For an overview of available experimental tomograms and processing methods see the table below.

Portal Name Pattern
Reconstruction Method
Postprocessing Method
Voxel Spacing [Å]
{Tomo ID} WBP FilteredWBP + CTF deconvolutionnone4.990
{Tomo ID} WBP DenoisedWBP + CTF deconvolutionDenoised with DenoisET4.990
{Tomo ID} WBP FilteredWBP + CTF deconvolutionnone10.012
{Tomo ID} WBP DenoisedWBP + CTF deconvolutionDenoised with DenoisET10.012

Personal and Sensitive Information

The dataset does not include personal or sensitive information.

Dataset Creation

Curation Rationale

Generating particle picks from cryoET tomograms is a difficult task. The phantom dataset was created to benchmark picking algorithms that may facilitate this process. Given that current particle picking and curation tools are very limited, the “ground truth” labels for this dataset were curated to avoid false positives at the cost of having some false negatives (missed picks). Therefore, the ground truth annotations in the phantom dataset are accurate but do not include all particles.

Data Collection and Processing

The tilt series for each experimental run included in this cryoET phantom dataset were collected on a Krios G4 at CZII in one session using the same set of collection parameters. AreTomo3 was used for tilt series alignment and tomogram reconstruction. Reconstructed tomograms were denoised using DenoisET. See paper for data processing details.

Annotation process

Six protein complexes are annotated in the phantom dataset including: 80S ribosomes, virus-like particle, apoferritin, thyroglobulin, beta-galactosidase, and beta-amylase. The process for getting the final set of ground truth annotations for each protein complex involved a variety of tools, because no single approach worked across all protein complexes. Briefly, annotations were generated using a combination of manual labeling, 2D/3D template matching, and two different 3D deep learning workflows. After annotation generation, annotation curation was carried out by 2D/3D structure determination and classification, 2D deep learning, and manual cleaning and verification. See paper for annotation details.

Who are the annotators?

The phantom dataset was annotated by the CZII team that generated and processed the data.

Bias, Risks, and Limitations

Limitations

  • Datasets may include unannotated particles
  • Annotations are limited to six protein complexes

Caveats and Recommendations

  • It's recommended to use a metric that reduces the penalty for false positives when evaluating a model against the ground truth particle locations to account for missing annotations. An example of that approach is the Fβ-score that was employed during the Kaggle competition. At the time of the Kaggle competition, only 10.012 Å tomograms were available, so model performance on the tomograms with higher resolution might differ.

Acknowledgements

Thanks to all the individuals and organizations who contributed to the creation of the dataset and Kaggle competition:

  • Emma Lundberg (Stanford), Ellen Zhong (Princeton), Thorsten Wagner (MPI of Molecular Physiology), Tristan Bepler (NYSBC), Robert Kiewisz (NYSBC), Alister Burt (Genentech), and Lorenzo Gaifas (Grenoble) provided valuable insights for the design of the challenge.
  • Manuel Leonetti, Shivanshi Vaid, Madhuri Vangipuram, and Rodrigo Baltazar from CZ Biohub San Francisco generated HEK293T LAMP1-GFP cell lines and shared protocols.
  • Feng Wang and Simon Sander from David Agard's lab (UCSF) provided grids and grid functionalization protocols and contributed to protocol optimization.
  • Peng Jin from the Jan Lab (UCSF) provided a modified GFP-nanobody construct.
  • Mykhailo Kopylov and Charlie Dubbledam (NYSBC) provided VLPs.
  • Colleagues from the Chan Zuckerberg Initiative SciTech team and CZ Imaging Institute who participated in the pickathon, including: Ashley Anderson, Ben Nelson, Jun Ni, Ellaine Chou, Jessica Gadling, Kandarp Khandwala, Chili Chiu, Ann Jones, Timmy Huang, Janeece Pourroy, Dannielle McCarthy, Andy Sweet, Eric Wang, Kirsty Ewing, Mikala Caton, Manasa Venkatakrishnan, Yongbaek Cho, Nina Borja, Norbert Hill, Carmela Villegas, Shu-Hsien Sheu, Gorica Margulis, Noeli Pazsoldan.
  • Colleagues from the Chan Zuckerberg Initiative SciTech team who contributed to the development of the CryoET Data Portal: Jun Xi Ni, Jessica Gadling, Manasa Venkatakrishnan, Kira Evans, Jeremy Asuncion, Andrew Sweet, Janeece Pourroy, Zun Shi Wang, Kandarp Khandwala, Benjamin Nelson, Dannielle McCarthy, Eric M Wang, Richa Agarwal, Trent Smith, Bryan Chu, Dana Sadgat, Erin Hoops, Justine Larsen.
  • Kristen Maitland and Stephani Otte supported planning and execution of the competition.