BPD (Biological Particle Detector)

Version v1.0 released 01 Apr 2025

License

Repository

https://github.com/y284/biological-particle-detector

BPD is a computer vision model designed to localize proteins in 3D cellular images obtained through cryo-electron tomography (cryoET). It is based on a U-Net architecture and trained on experimental cryoET runs provided by the CZII CryoET Object Identification Kaggle competition. BPD represents the fifth-place solution of the competition.

Developed By

Youssef Ouertani

Get Started with Model

Model Details

Model Architecture

BPD uses a 3D U-Net architecture with 2 downsampling and 2 upsampling levels. Each level processes features through 28, 32, and 36 channels. Before each resolution change, a block of two 3D convolutions (each followed by BatchNorm and ReLU) extracts features, with trilinear interpolation handling upsampling and downsampling.

Note: The final model consists of an ensemble of 4 identical 3D U-Nets (as described above) trained with different random seeds.

Parameters

350K x 4

Citation

Peck, A., et al., (2025) A Realistic Phantom Dataset for Benchmarking Cryo-ET Data Annotation. Nature Methods. DOI: 10.1101/2024.11.04.621686

Primary Contact Email

Youssef Ouertani ouertaniyoussef@yahoo.fr

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

The algorithm needs an Nvidia GPU and CUDA to run at reasonable speed (in particular for training). The model was trained on a GPU P100. For running on other GPUs, some parameter values (e.g., patch and batch sizes) may need to be changed to adapt to available memory.

Intended Use

Primary Use Cases

Localization of protein complexes within tomograms

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Use that violates applicable laws or regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
Any use that is prohibited by the Apache-2.0 license.
Any use that is prohibited by the Acceptable Use Policy.

Training Details

Training Data

The training data was provided by the Chan Zuckerberg Imaging Institute (CZII) and included seven experimental runs with ground truth annotations for six protein complexes (apo-ferritin, beta-amylase, beta-galactosidase, cytosolic ribosomes, thyroglobulin and virus like particle).

Training Procedure

The model was trained on 3D tomogram volumes with spherical labels (radius = log2(given_radius)*0.8), normalized using min-max scaling based on averaged (5, 99) percentiles across all 7 tomograms. Each epoch consisted of 1024 randomly sampled 128×128×128 patches (batches of 4) with data augmentation including flipping, z-axis rotations (90°/180°/270°), and ±3% intensity shifts. Training ran for 35 epochs (4 hours total) using Adam (lr=0.0001, β₁=0.9, β₂=0.999) with fp16 mixed precision, gradient clipping, and label-smoothed cross-entropy (smoothing=0.01).

Training Code

Kaggle Competition Notebook

Data Sources

Training data is available through the CZII - CryoET Object Identification Challenge deposition site.

Performance Metrics

Metrics

The model was evaluated by calculating the F-beta metric with a beta value of 4. The F-beta metric with a beta value of 4 is used to prioritize recall over precision, heavily penalizing missed particles while being more lenient on false positives. In this context, a particle is considered "true" if it lies within a factor of 0.5 of the particle of interest's radius. There are five particles of interest, with three "easy" particles (ribosome, virus-like particles, and apo-ferritin) assigned a weight of 1 and two "hard" particles (thyroglobulin and β-galactosidase) assigned a weight of 2. The results are micro-averaged across multiple tomograms, ensuring that precision and recall are computed across the entire dataset before applying the F-beta formula. The higher beta value (4) and particle weights emphasize the correct identification of particles, particularly the "hard" ones, making recall the dominant factor in evaluating performance.

Evaluation Datasets

The evaluation datasets included public and private test datasets found in the CryoET Data Portal deposition site for the CZII CryoET Object Identification Kaggle competition. The public and private test datasets contain 121 and 364 experimental runs, respectively.

Evaluation Results

Public Score	Private Score
0.77982	0.78252

Biases, Risks, and Limitations

Potential Biases

The model was trained on five particle types and won’t work with particles not present in the training data.

Risks

Areas of risk may include but are not limited to:

Inaccurate outputs or hallucinations
Incorrect prediction

Limitations

The model's performance may be limited by the size of the training set.

Caveats and Recommendations

Review and validate outputs generated by the model.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Acknowledgements

This research is supported by the Chan Zuckerberg Imaging Institute.

If you have recommendations for this model card please contact virtualcellmodels@chanzuckerberg.com.

Get Started with Model