Octopi

Version v1.0.0 released 14 Nov 2025

License

Repository

https://github.com/chanzuckerberg/octopi

The Object deteCTion Of ProteIns (Octopi) model is a deep learning framework for multi-class particle picking in cryo-electron tomography (cryoET). Trained on the CZ Imaging Institute phantom dataset, it identifies six molecular species: apoferritin, β-amylase, β-galactosidase, 80S ribosomes, thyroglobulin, and virus-like particles. The model was developed as part of a Kaggle ML challenge to benchmark cryoET annotation algorithms.

Developed By

Jonathan Schwartz¹

1 Chan Zuckerberg Imaging Institute

Try Model with Demo Dataset

Model Details

Model Architecture

Octopi uses a U-Net architecture with 6 encoder-decoder levels optimized for multi-class particle picking in cryo-electron tomography. The U-Net architecture was chosen for its proven effectiveness in biomedical image segmentation tasks and its ability to capture both local and global context through its encoder-decoder structure with skip connections.

Architecture specifications:

Base architecture: U-Net (modified from MONAI implementation)
Number of classes: 7 (6 particle types + background)
Channel progression: [32, 64, 128, 128, 128, 128]
Strides: [2, 2, 1, 1, 1]
Residual units per level: 1

Parameters

5,029,192

Model Card Authors

Jonathan Schwartz (CZII)

Citation

Peck, A., Yu, Y., Schwartz, J. et al. A realistic phantom dataset for benchmarking cryo-ET data annotation. Nat Methods 22, 1819--1823 (2025). DOI: 10.1038/s41592-025-02800-5

Primary Contact Email

jonathan.schwartz@czii.org

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

Requires CUDA-capable GPU for inference and training on cryoET tomograms, such as T4 or better.

Intended Use

Primary Use Cases

Octopi is designed for automated particle picking and annotation in cryoET datasets. Specific use cases include:

Multi-class particle identification: Simultaneously identifying and localizing six molecular species (apoferritin, β-amylase, β-galactosidase, 80S ribosomes, thyroglobulin, and virus-like particles) in cryoET tomograms.
Image segmentation: Generating 3D segmentation masks for molecular complexes in tomographic volumes.
Feature extraction: Learning representations of molecular structures in their native cellular context.
Benchmark evaluation: Serving as a baseline model for comparing cryoET annotation algorithms on the CZ Imaging Institute phantom dataset.

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
Any use that is prohibited by the MIT license.
Any use that is prohibited by the Acceptable Use Policy.

Training Data

The model was trained on the CZ Imaging Institute Phantom Dataset, an experimental cryoET dataset specifically designed for benchmarking particle picking algorithms hosted by the CryoET Data Portal. The phantom dataset includes experimental training data (ID: DS-10440) as well as public (ID: DS-10445) and private (ID: DS-10446) test datasets. The training data consists of 6 tomograms from the phantom dataset, 1 held out for validation.

Dataset characteristics:

Experimental tomograms: 7 high-quality tomograms from dataset runs 16465, 16466, 16468, 16469, 16464, 16467.
Imaging conditions: Collected on Krios G4 with Falcon 4i detector, ±45° tilt range.
Sample composition: Mixture of purified proteins in lysosome-enriched lysate creating realistic cellular crowding and ~200 nm sample thickness.

Particle classes:

Apoferritin (450 kDa, octahedral symmetry) - ~23,393 particles
β-Amylase (268 kDa, D2 symmetry) - ~2,464 particles
β-Galactosidase (540 kDa, D2 symmetry) - ~3,113 particles
80S Ribosome (4.3 MDa, monomer) - ~24,338 particles
Thyroglobulin (660 kDa, homodimer) - ~6,211 particles
Virus-like particles (3.4 MDa, icosahedral) - ~3,022 particles

Training Procedure

Data preprocessing:

Volume patches of 112³ voxels extracted centered on annotated particles.
Data augmentation: random rotations, intensity scaling, and additive Gaussian noise.

Training strategy:

The model is trained on volumetric patches extracted from tomograms and centered on annotated particles. Data augmentation includes random 3D rotations, intensity scaling, and additive Gaussian noise to improve generalization. Validation is performed using sliding window inference to handle full-resolution volumes. Training employs Exponential Moving Average (EMA) with a cosine annealing learning rate scheduler. The model is optimized to maximize class-averaged F-beta scores tracked throughout training. Early stopping monitors both training loss stability and validation metrics, with the best-performing model checkpoint saved based on the target metric.

Loss function:

Focal Tversky Loss with α=0.152 and γ=1.889 to handle class imbalance and focus learning on difficult examples.

Training Code

Training scripts and configuration files are available in the GitHub repository: https://github.com/chanzuckerberg/octopi.

Speeds, Sizes, Times

Model Checkpoint Size: 20 MB
Inference Speed: 500 tomograms per hour.

Training Hyperparameters

Optimizer: Adam
Learning rate: 2.796e-6
Loss function: Focal Tversky Loss
- Alpha parameter: 0.152
- Gamma parameter: 1.889
Dropout: 0.23
Total epochs: 500

Data Sources

The following datasets were used for training and evaluation:

Performance Metrics

Metrics

The model was evaluated using the Fβ-score used in the Kaggle CryoET Object Identification Challenge. The designated evaluation metric was the Fβ score with β = 4 and a distance threshold scale of 0.5. This configuration accounts for potential incompleteness in ground truth annotations while emphasizing recall, which ensures that predicted coordinates are not just present, but accurately centered on the macromolecule.

Evaluation Datasets

CZII - CryoET Object Identification Challenge Public Test Dataset (ID: DS-10445): https://cryoetdataportal.czscience.com/datasets/10445

CZII - CryoET Object Identification Challenge - Private Test Dataset (ID: DS-10446): https://cryoetdataportal.czscience.com/datasets/10446

Evaluation Results

Particle class	Precision	Recall	Fβ-score
Apoferritin	0.561	0.958	0.92
β-Amylase	0.091	0.635	0.47
β-Galactosidase	0.161	0.745	0.614
80S Ribosome	0.467	0.945	0.891
Thyroglobulin	0.188	0.754	0.641
VLP	0.593	0.981	0.944

The final weighted score for this model is 0.752.

Biases, Risks, and Limitations

Potential Biases

Training data bias: The model was trained exclusively on the phantom dataset, which represents a simplified experimental system compared to actual cellular environments. The particle distribution, crowding level, and sample thickness may not fully represent cellular tomograms.
Particle size bias: The model may perform better on larger, higher-contrast particles (ribosomes, VLPs, apoferritin) compared to smaller particles (β-amylase) due to class imbalance in the training data.
Imaging condition bias: Trained on data from a single microscope (Krios G4 with Falcon 4i) using specific imaging parameters. Performance may vary on data from different instruments or acquisition schemes.
Species representation: Six molecular species are represented, which are a small fraction within the cellular proteome. Performance may vary on other molecular species.

Risks

Areas of risk may include but are not limited to:

Misclassification: Particles may be detected but assigned to the wrong class, especially for similarly-sized species (e.g., β-galactosidase vs thyroglobulin).
Resolution dependency: Performance degrades on tomograms far from the training resolution (10 Å per voxel).

Limitations

Limited generalization: The model is specifically trained for the six particle types in the phantom dataset and will not detect other molecular species without retraining.
Computational requirements: Requires GPU hardware for practical inference speeds.

Caveats and Recommendations

Review and validate outputs generated by the model.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.
Should you have any security or privacy issues or questions related to the model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com, respectively.

Acknowledgements

We thank the support and computing resources from CZI, Biohub Network, and the CZ Imaging Institute.

Try Model with Demo Dataset