Designing a ML Competition for CryoET Data with Limited Annotations

Take a behind-the-scenes look at Chan Zuckerberg Imaging Institute's efforts in the development and hosting of an ML competition to boost cryo-electron tomography particle detection.

Kyle Harrington | December 9, 2024

Introduction

There is a critical bottleneck in cryo-electron tomography (cryoET): the dependence on manual curation of particle annotations for reconstruction and analysis of structures. CryoET scientists need to identify 3D structures within tomograms, such as proteins or macromolecular complexes, to study cellular machinery at near-atomic resolution. However, the annotation process is time-consuming and takes from the researchers’ capacity to analyze new and diverse proteins, given that current particle-picking models often underperform in recognizing these varied structures. Our ML competition aims to push the field forward by evaluating models that can achieve high performance across multiple particle types of varying sizes in a complex experimental dataset. Additionally, we only provide a small number of tomograms with limited annotations to model the constraint of researchers only being able to provide a small number of annotations.

Challenges of cryoET and Limited Annotations

A major challenge in cryoET is "particle picking," or detecting the locations of particles (usually proteins or macromolecular complexes) within 3D tomograms. This process is essential to achieve high-resolution structures in studies utilizing subtomogram averaging. Beyond standard object detection, cryoET particle picking involves identifying numerous particle types with variable sizes in images subject to unique challenges such as imaging artifacts and noise. Traditional models—often based on YOLO [Wagner et al, 2019], ResNet [Bepler, et al, 2020], and U-Net architectures [Moebel et al, 2021]—struggle to generalize across diverse datasets, requiring extensive annotations to achieve reliable performance. Our competition addresses this gap by prioritizing models that perform well under limited annotation scenarios, simulating real-world conditions for cryoET researchers.

A Dataset from Scratch

During the initial design of our competition we engaged with internal and external stakeholders and found that the field was looking for a new experimental dataset with different particle types which can be used for the development of new particle picking methods. Most current methods had been developed on either synthetic data or the canonical particle for benchmarking, ribosomes.

To address this, in a CZ Imaging Institute-wide effort, a “phantom” sample was developed, imaged, processed, and annotated [Peck et al, 2024]. This “phantom” was created as a representative experimental sample, where ribosomes and background material were obtained from cell lysate and then enhanced with 6 particles of interest and biological relevance: apoferritin, thyroglobulin (THG), Beta-galactosidase, Beta-amylase, Human Serum Albumin (HSA), and virus-like particles (VLPs). These particles were deliberately selected based on their size (12nm-30nm) and shape. After imaging and processing, we found that 492 tomograms were of usable quality for the competition.

However, creating annotations for these tomograms was a feat that involved many rounds of iteration, a data curation event that included participants from across the whole Chan Zuckerberg ecosystem, numerous computational tools, and many experts. The complete annotation process is described in [Peck et al, 2024], while some of the tools that were developed to support this and future competitions are described in [Harrington et al, 2024a]. An important point about the resulting annotations is that due to the imaging and visual properties of the particles (especially the smaller ones) it was not possible to create 100% comprehensive annotations. Not only does this emphasize the need for computational tools to detect these particles, but this hints at the challenge of evaluating performance.

The Dilemma of Evaluation

A major constraint in designing the challenge was evaluating the performance of models given that the ground truth annotations are not perfect. Our approach was to account for the fact that filtering false positives can be done reliably with post-processing methods, like class averaging and clustering, leading us to focus on evaluation metrics that penalize false negatives (missing known objects) more than false positives (predicting an object when no object is present).

Furthermore, we needed to account for the leaderboard mechanism of our competition, where two test sets are used: a public leaderboard is used during the competition and a private leaderboard is used to determine ranking for prizes. The constraint this imposes is that we want the ranking to be as consistent as possible between the public and private datasets, but we have to determine that before knowing the models that will be submitted.

We addressed this by "red teaming" the competition, where we tested adversarial picking strategies, and synthesized submissions of varying quality to create an artificial leaderboard. This supported our exploration and testing of evaluation metrics. We ultimately chose to use a micro-averaged, f-beta score with per-particle weighting and a distance-threshold based point matching mechanism. The complete definition of the evaluation metric can be found in [Peck et al, 2024].

The Tools

To streamline data handling, annotation, and model testing, we have developed a suite of open-source tools, including:

copick for dataset management, enabling easy access to tomograms, particle picks, and segmentation masks, with copick-utils for convenience functions and copick-torch to support data access and processing in PyTorch
visualization tools for interactive inspection, allowing participants to visualize tomograms, particle locations, and segmentation masks in napari, and visualization of collaborative annotation projects with copicklive.
catalogs of open-source reusable, executable solutions for model training, evaluation, visualization, and data conversion.

These tools, particularly the visualization components, allow participants to directly compare outputs from multiple models, helping compare approaches and visually inspect model predictions. An extended description of the tools is presented in [Harrington et al, 2024a].

The Competition

The competition itself began on November 6, 2024 and is hosted on Kaggle [Harrington et al, 2024b]. Prizes are expected to be awarded to 10 submissions with the hope of incentivizing a broad set of solutions. Upon completion of the competition, representatives of the winning teams and community will be invited to a workshop to discuss the results, and future directions for competitions and particle picking for cryoET.

We developed this competition to help stimulate the development of new methods for particle picking in cryoET and establish a benchmark for the community. By following guidance from the community and carefully designing our dataset to match the current needs of the field, we hope to establish a focused benchmark for particle picking across particle sizes in experimental data. Finally, while we are constrained to working with limited annotations now, we hope to continue to refine and improve this dataset until the annotations are fully refined.

References

[Wagner et al, 2019] - Wagner, T., Merino, F., Stabrin, M., Moriya, T., Antoni, C., Apelbaum, A., Hagel, P., Sitsel, O., Raisch, T., Prumbaum, D. and Quentin, D., 2019. SPHIRE-crYOLO is a fast and accurate fully automated particle picker for cryo-EM. Communications biology, 2(1), p.218.
[Bepler, et al, 2020] - Bepler, T., Kelley, K., Noble, A.J. and Berger, B., 2020. Topaz-Denoise: general deep denoising models for cryoEM and cryoET. Nature communications, 11(1), p.5208.
[Moebel et al, 2021] - Moebel, E., Martinez-Sanchez, A., Lamm, L., Righetto, R.D., Wietrzynski, W., Albert, S., Larivière, D., Fourmentin, E., Pfeffer, S., Ortiz, J. and Baumeister, W., 2021. Deep learning improves macromolecule identification in 3D cellular cryo-electron tomograms. Nature methods, 18(11), pp.1386-1394.
[Peck et al, 2024] - Peck, A., Yu, Y., Schwartz, J., Cheng, A., Ermel, U., Kandel, S., Kimanius, D., Montabana, E., Serwas, D., Siems, H., Wang, F., Zhao, Z., Zheng, S., Haury, M., Agard, D., Potter, C., Carragher, B., Harrington, K., Paraan, M., 2024. Annotating CryoET Volumes: A Machine Learning Challenge. bioRxiv, doi:10.1101/2024.11.04.621686v1.
[Harrington et al, 2024a] - Harrington, K., Zhao, Z., Schwartz, J., Kandel, S., Ermel, U., Paraan, M., Potter., C., and Carragher, B., 2024. Open-source Tools for CryoET Particle Picking Machine Learning Competitions. NeurIPS MLSB Workshop, doi:10.1101/2024.11.04.621608v1.
[Harrington et al, 2024b] - Kyle Harrington*, Mohammadreza Paraan*, Anchi Cheng, Utz Heinrich Ermel, Saugat Kandel, Dari Kimanius, Elizabeth Montabana, Ariana Peck, Jonathan Schwartz, Daniel Serwas, Hannah Siems, Feng Wang, Yue Yu, Zhuowen Zhao, Shawn Zheng, Walter Reade, Maggie Demkin, Kristen Maitland, Dannielle McCarthy, Matthias Haury, David Agard, Clinton Potter, and Bridget Carragher, 2024. CZ Imaging Institute - CryoET Object Identification. https://kaggle.com/competitions/czii-cryo-et-object-identification, Unpublished. Kaggle.