Quickstart: MonjuDetectHM
Estimated time to complete: 15 minutes
Learning Goals
By the end of this quickstart, you will learn how to:
- Load and run inference with a pre-trained 3D particle detection model for Cryo-ET data
- Understand the input/output format for particle detection models
Prerequisites
Software Requirements
- Python 3.12
- GPU with CUDA support
uvpackage manager (installable by runningpip install uv)
Google Colab Requirements
- Standard GPU runtime (T4 or better recommended)
Libraries
- All dependencies listed in
pyproject.toml(automatically installed byuv sync)
Data Requirements
- Dataset compatible with copick
- Pre-packaged mlflow model
(Both will be downloaded below)
Introduction
This quickstart demonstrates how to use MonjuDetectHM, a pre-trained 3D deep learning model for detecting biological particles in cryo-electron tomography (cryo-ET) volumes. The model can identify 5 different particle types: apo-ferritin, beta-galactosidase, ribosome, thyroglobulin, and virus-like particles.
Note the following regarding data:
Demo data and preprocessing: The demo data we are using for this quickstart is the dataset used in the CryoET-Object Identification Competition. CoPick handles data loading and the MLflow model handles preprocessing.
Input: The model uses 3D cryo-ET volumes in CoPick format as input.
Output: The model output reflects particle detection coordinates and confidence scores for each particle class.
Setup
Prepare Dataset
We will be using the script from https://copick.github.io/copick/examples/tutorials/kaggle_czii_sync/ to prepare the dataset that we will be using in this notebook.
!uv pip install "copick[all]>=1.10.0" "cryoet-data-portal"import copick
from copick.ops.sync import sync_tomograms, sync_picks
import cryoet_data_portal as cdp
# Step 1: Establish connection and name mappings
# =============================================
# Connect to the CryoET Data Portal
client = cdp.Client()
# Retrieve all runs from dataset 10440 (CZII competition dataset)
runs = cdp.Run.find(client, [cdp.Run.dataset_id == 10440])
# Create mapping from portal run IDs to Kaggle-compatible run names
# Copick normally uses run IDs by default because they are unique,
# while run names may not be unique across multiple cryoET data portal datasets.
portal_runs_to_kaggle_runs = {str(r.id): r.name for r in runs}
# Map portal object names to competition object names
# The portal uses scientific names (Gene Ontology Term/UniProtKB accession),
# while the Kaggle competition uses simplified names
portal_objects_to_kaggle_objects = {
"beta-galactosidase": "beta-galactosidase",
"cytosolic-ribosome": "ribosome",
"virus-like-capsid": "virus-like-particle",
"ferritin-complex": "apo-ferritin",
"beta-amylase": "beta-amylase",
"thyroglobulin": "thyroglobulin",
}
# Step 2: Configure source data access
# ====================================
# Create Copick root pointing to CryoET Data Portal dataset
# The '/tmp/overlay' path won't store anything - it's just required by the API
portal_root = copick.from_czcdp_datasets([10440], '/tmp/overlay')
# Extract and rename pickable objects for the target dataset
objects = []
for obj in portal_root.config.pickable_objects:
if obj.name in portal_objects_to_kaggle_objects:
# Create a copy with the Kaggle-compatible name
kaggle_obj = obj.copy()
kaggle_obj.name = portal_objects_to_kaggle_objects[obj.name]
objects.append(kaggle_obj)
# Step 3: Create target dataset structure
# =======================================
# Create new Copick project with Kaggle-compatible structure
# This will store the synchronized data locally.
# Update the paths as needed for your environment.
target_root = copick.new_config(
'/tmp/czcdp_dataset_demo/copick_config.json', # Configuration file path
'/tmp/czcdp_dataset_demo/', # Data storage directory
pickable_objects=objects # Object definitions with correct names
)
# Step 4: Sync tomographic data
# =============================
# Copick constructs the tomogram type from data portal metadata. The competition used
# simplified names for tomograms, so we map the portal's processed tomogram type as well.
sync_tomograms(
portal_root, # Source: CryoET Data Portal
target_root, # Target: Local Copick project
source_runs=list(portal_runs_to_kaggle_runs.keys()), # All available runs
target_runs=portal_runs_to_kaggle_runs, # Run name mapping
voxel_spacings=[10.012], # Competition voxel size
source_tomo_types=["wbp-denoised-denoiset-ctfdeconv"], # Tomogram type to sync
target_tomo_types={"wbp-denoised-denoiset-ctfdeconv": "denoised"}, # Mapping to simplified name
log=True, # Show progress
exist_ok=True, # Allow overwriting
)
# Step 5: Sync annotation data
# ============================
sync_picks(
portal_root, # Source: CryoET Data Portal
target_root, # Target: Local Copick project
source_runs=list(portal_runs_to_kaggle_runs.keys()), # All available runs
target_runs=portal_runs_to_kaggle_runs, # Run name mapping
source_objects=list(portal_objects_to_kaggle_objects.keys()), # Portal object names
target_objects=portal_objects_to_kaggle_objects, # Kaggle object name mapping
log=True, # Show progress
exist_ok=True, # Allow overwriting
)Clone Repository & Install
## Setup Environment
# Clone the repository
print('Cloning repository...')
!git clone https://github.com/kobakos/MonjuDetectHM.git
# Install dependencies
!cd MonjuDetectHM && uv sync
!cd MonjuDetectHM && uv pip install -e .# Reset the directory
%cd /content/MonjuDetectHM/Download Model Files
The script below downloads the monjudetecthm_mlflow.tar.gz file found in https://drive.google.com/drive/folders/1hu1K1hGAkn-l-Mmn53P_XumvTtVEdrTj?usp=sharing and extracts it.
!gdown --id 1ZWs8VLXvicXQhahVBE72CxVzxD_MIcFo
!tar -zxvf monjudetecthm_mlflow.tar.gz
!rm monjudetecthm_mlflow.tar.gzRun Model Inference
Now we will run inference using the pre-packaged MLflow model. We will use copick to manage our data using the configuration generated in a previous step (see script under Prepare Dataset).
Estimated time to run: around 2 minutes in T4 runtime
# Define the path to the MLflow model
model_path = 'monjudetecthm_mlflow'
copick_config_path = '/tmp/czcdp_dataset_demo/copick_config.json'import copick
import mlflow
from pathlib import Path
import pandas as pd
# Check if the model path exists
if not Path(model_path).exists():
print(f"Model not found at {model_path}")
else:
print(f"Loading model from {model_path}")
# Load the MLflow model
model = mlflow.pyfunc.load_model(model_path)
# Define the path to the copick configuration
# Load CoPick root
print(f"Loading CoPick root from {copick_config_path}")
copick_root = copick.from_file(copick_config_path)
if not copick_root.runs:
print("No runs found in the CoPick root. Please check your data configuration.")
else:
# Select the first experiment to run inference on
experiment_id = copick_root.runs[0].name
print(f"Running inference on experiment: {experiment_id}")
# Prepare the model input
model_input = {
'copick_root': copick_root,
'experiment_id': experiment_id,
'voxel_spacing': 10.012,
'threshold': 0.3
}
# Run inference
# This might take a few minutes
results = model.predict(model_input)
print("\ninference results: ", results)Output:
Loaded ensemble of 3 models on cuda
Loading CoPick root from /tmp/czcdp_dataset_demo/copick_config.json
Running inference on experiment: TS_5_4
inference results: {'TS_5_4': {'apo-ferritin': {'points': array([[5476.564 , 530.636 , 270.324 ],
[6037.236 , 3484.176 , 310.372 ],
[5386.456 , 2893.468 , 340.408 ],
[2312.772 , 440.528 , 500.6 ],
[1241.488 , 1221.464 , 620.744 ],
[2573.084 , 2182.616 , 640.768 ],
[3023.624 , 2853.42 , 690.828 ],
[ 981.176 , 2242.688 , 720.864 ],
[1581.896 , 1231.476 , 760.912 ],
[4645.568 , 2983.576 , 770.924 ],
[1491.788 , 1361.632 , 780.93604],
[1651.98 , 1181.416 , 810.972 ],
[1461.752 , 1251.5 , 810.972 ],
[ 780.93604, 5566.672 , 851.02 ],
[2382.856 , 4525.424 , 891.068 ],
[1501.8 , 1371.644 , 901.08 ],
[1712.052 , 5126.144 , 951.14 ],
[3974.764 , 1832.196 , 981.176 ],
[5236.276 , 3143.768 , 1021.224 ],
[1641.968 , 5036.036 , 1021.224 ],
[1772.124 , 4945.928 , 1061.272 ],
...
[5436.516 , 1011.21204, 1331.5961 ],
[3173.804 , 2723.264 , 1461.752 ]], dtype=float32), 'confidence': array([4.500285 , 4.1581726, 3.801238 , 4.2265325, 3.8788574, 3.6727734,
4.156523 , 4.516871 , 4.4099774, 3.5110698, 1.9641333, 3.3476985],
dtype=float32)}}}Model Outputs
The model returns a dictionary containing the detected particles for each experiment. The keys of the dictionary are the experiment IDs. The values are dictionaries where keys are particle class names and values are dictionaries containing the detected points and their confidence scores.
Here is an example of the output structure:
{
'experiment_id_1': {
'apo-ferritin': {
'points': array([[x1, y1, z1], [x2, y2, z2], ...]),
'confidence': array([c1, c2, ...])
},
'beta-galactosidase': {
'points': array([[x3, y3, z3], ...]),
'confidence': array([c3, ...])
},
...
}
}The code in the previous step converts this dictionary into a pandas DataFrame for easier inspection, with the following columns:
id: A unique identifier for each detected particle.experiment: The experiment ID from which the particle was detected.particle_type: The predicted class of the particle.x,y,z: The coordinates of the detected particle in Angstroms.
Contact and Acknowledgments
For issues with this quickstart please contact Koki Kobayashi at koki-kobayashi@outlook.jp.
This model and quickstart were developed by Koki Kobayashi.
Special thanks to the CZI team for their support in developing this repository.
References
- CZI Cryo-ET Object Identification Kaggle Competition: https://www.kaggle.com/competitions/czii-cryo-et-object-identification
- MonjuDetectHM GitHub Repository: https://github.com/kobakako/MonjuDetectHM
- Copick: https://copick.github.io/copick/
- CryoET Data Portal: https://cryoetdataportal.czscience.com/
Responsible Use
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.