Quickstart: MonjuDetectHM

Estimated time to complete: 15 minutes

Learning Goals

By the end of this quickstart, you will learn how to:

Load and run inference with a pre-trained 3D particle detection model for Cryo-ET data
Understand the input/output format for particle detection models

Prerequisites

Software Requirements

Python 3.12
GPU with CUDA support
uv package manager (installable by running pip install uv)

Google Colab Requirements

Standard GPU runtime (T4 or better recommended)

Libraries

All dependencies listed in pyproject.toml (automatically installed by uv sync)

Data Requirements

Dataset compatible with copick
Pre-packaged mlflow model

(Both will be downloaded below)

Introduction

This quickstart demonstrates how to use MonjuDetectHM, a pre-trained 3D deep learning model for detecting biological particles in cryo-electron tomography (cryo-ET) volumes. The model can identify 5 different particle types: apo-ferritin, beta-galactosidase, ribosome, thyroglobulin, and virus-like particles.

Note the following regarding data:

Demo data and preprocessing: The demo data we are using for this quickstart is the dataset used in the CryoET-Object Identification Competition. CoPick handles data loading and the MLflow model handles preprocessing.

Input: The model uses 3D cryo-ET volumes in CoPick format as input.

Output: The model output reflects particle detection coordinates and confidence scores for each particle class.

Setup

Prepare Dataset

We will be using the script from https://copick.github.io/copick/examples/tutorials/kaggle_czii_sync/ to prepare the dataset that we will be using in this notebook.

!uv pip install "copick[all]>=1.10.0" "cryoet-data-portal"

import copick
from copick.ops.sync import sync_tomograms, sync_picks
import cryoet_data_portal as cdp

# Step 1: Establish connection and name mappings
# =============================================

# Connect to the CryoET Data Portal
client = cdp.Client()

# Retrieve all runs from dataset 10440 (CZII competition dataset)
runs = cdp.Run.find(client, [cdp.Run.dataset_id == 10440])

# Create mapping from portal run IDs to Kaggle-compatible run names
# Copick normally uses run IDs by default because they are unique,
# while run names may not be unique across multiple cryoET data portal datasets.
portal_runs_to_kaggle_runs = {str(r.id): r.name for r in runs}

# Map portal object names to competition object names
# The portal uses scientific names (Gene Ontology Term/UniProtKB accession),
# while the Kaggle competition uses simplified names
portal_objects_to_kaggle_objects = {
    "beta-galactosidase": "beta-galactosidase",
    "cytosolic-ribosome": "ribosome",
    "virus-like-capsid": "virus-like-particle",
    "ferritin-complex": "apo-ferritin",
    "beta-amylase": "beta-amylase",
    "thyroglobulin": "thyroglobulin",
}

# Step 2: Configure source data access
# ====================================

# Create Copick root pointing to CryoET Data Portal dataset
# The '/tmp/overlay' path won't store anything - it's just required by the API
portal_root = copick.from_czcdp_datasets([10440], '/tmp/overlay')

# Extract and rename pickable objects for the target dataset
objects = []
for obj in portal_root.config.pickable_objects:
    if obj.name in portal_objects_to_kaggle_objects:
        # Create a copy with the Kaggle-compatible name
        kaggle_obj = obj.copy()
        kaggle_obj.name = portal_objects_to_kaggle_objects[obj.name]
        objects.append(kaggle_obj)

# Step 3: Create target dataset structure
# =======================================

# Create new Copick project with Kaggle-compatible structure
# This will store the synchronized data locally.
# Update the paths as needed for your environment.
target_root = copick.new_config(
    '/tmp/czcdp_dataset_demo/copick_config.json',  # Configuration file path
    '/tmp/czcdp_dataset_demo/',      # Data storage directory
    pickable_objects=objects         # Object definitions with correct names
)

# Step 4: Sync tomographic data
# =============================

# Copick constructs the tomogram type from data portal metadata. The competition used
# simplified names for tomograms, so we map the portal's processed tomogram type as well.
sync_tomograms(
    portal_root,                                           # Source: CryoET Data Portal
    target_root,                                           # Target: Local Copick project
    source_runs=list(portal_runs_to_kaggle_runs.keys()),   # All available runs
    target_runs=portal_runs_to_kaggle_runs,                # Run name mapping
    voxel_spacings=[10.012],                               # Competition voxel size
    source_tomo_types=["wbp-denoised-denoiset-ctfdeconv"], # Tomogram type to sync
    target_tomo_types={"wbp-denoised-denoiset-ctfdeconv": "denoised"}, # Mapping to simplified name
    log=True,                                              # Show progress
    exist_ok=True,                                         # Allow overwriting
)

# Step 5: Sync annotation data
# ============================

sync_picks(
    portal_root,                                           # Source: CryoET Data Portal
    target_root,                                           # Target: Local Copick project
    source_runs=list(portal_runs_to_kaggle_runs.keys()),   # All available runs
    target_runs=portal_runs_to_kaggle_runs,                # Run name mapping
    source_objects=list(portal_objects_to_kaggle_objects.keys()), # Portal object names
    target_objects=portal_objects_to_kaggle_objects,       # Kaggle object name mapping
    log=True,                                              # Show progress
    exist_ok=True,                                         # Allow overwriting
)

Clone Repository & Install

## Setup Environment
# Clone the repository
print('Cloning repository...')
!git clone https://github.com/kobakos/MonjuDetectHM.git
# Install dependencies
!cd MonjuDetectHM && uv sync
!cd MonjuDetectHM && uv pip install -e .

# Reset the directory
%cd /content/MonjuDetectHM/

Download Model Files

The script below downloads the monjudetecthm_mlflow.tar.gz file found in https://drive.google.com/drive/folders/1hu1K1hGAkn-l-Mmn53P_XumvTtVEdrTj?usp=sharing and extracts it.

!gdown --id 1ZWs8VLXvicXQhahVBE72CxVzxD_MIcFo
!tar -zxvf monjudetecthm_mlflow.tar.gz
!rm monjudetecthm_mlflow.tar.gz

Run Model Inference

Now we will run inference using the pre-packaged MLflow model. We will use copick to manage our data using the configuration generated in a previous step (see script under Prepare Dataset).

Estimated time to run: around 2 minutes in T4 runtime

# Define the path to the MLflow model
model_path = 'monjudetecthm_mlflow'
copick_config_path = '/tmp/czcdp_dataset_demo/copick_config.json'

import copick
import mlflow
from pathlib import Path
import pandas as pd

# Check if the model path exists
if not Path(model_path).exists():
    print(f"Model not found at {model_path}")
else:
    print(f"Loading model from {model_path}")
    # Load the MLflow model
    model = mlflow.pyfunc.load_model(model_path)

    # Define the path to the copick configuration

    # Load CoPick root
    print(f"Loading CoPick root from {copick_config_path}")
    copick_root = copick.from_file(copick_config_path)

    if not copick_root.runs:
        print("No runs found in the CoPick root. Please check your data configuration.")
    else:
        # Select the first experiment to run inference on
        experiment_id = copick_root.runs[0].name
        print(f"Running inference on experiment: {experiment_id}")

        # Prepare the model input
        model_input = {
            'copick_root': copick_root,
            'experiment_id': experiment_id,
            'voxel_spacing': 10.012,
            'threshold': 0.3
        }

        # Run inference
        # This might take a few minutes
        results = model.predict(model_input)
        print("\ninference results: ", results)

Output:
    Loaded ensemble of 3 models on cuda
    Loading CoPick root from /tmp/czcdp_dataset_demo/copick_config.json
    Running inference on experiment: TS_5_4

    inference results:  {'TS_5_4': {'apo-ferritin': {'points': array([[5476.564  ,  530.636  ,  270.324  ],
        [6037.236  , 3484.176  ,  310.372  ],
        [5386.456  , 2893.468  ,  340.408  ],
        [2312.772  ,  440.528  ,  500.6    ],
        [1241.488  , 1221.464  ,  620.744  ],
        [2573.084  , 2182.616  ,  640.768  ],
        [3023.624  , 2853.42   ,  690.828  ],
        [ 981.176  , 2242.688  ,  720.864  ],
        [1581.896  , 1231.476  ,  760.912  ],
        [4645.568  , 2983.576  ,  770.924  ],
        [1491.788  , 1361.632  ,  780.93604],
        [1651.98   , 1181.416  ,  810.972  ],
        [1461.752  , 1251.5    ,  810.972  ],
        [ 780.93604, 5566.672  ,  851.02   ],
        [2382.856  , 4525.424  ,  891.068  ],
        [1501.8    , 1371.644  ,  901.08   ],
        [1712.052  , 5126.144  ,  951.14   ],
        [3974.764  , 1832.196  ,  981.176  ],
        [5236.276  , 3143.768  , 1021.224  ],
        [1641.968  , 5036.036  , 1021.224  ],
        [1772.124  , 4945.928  , 1061.272  ],
    ...
        [5436.516  , 1011.21204, 1331.5961 ],
        [3173.804  , 2723.264  , 1461.752  ]], dtype=float32), 'confidence': array([4.500285 , 4.1581726, 3.801238 , 4.2265325, 3.8788574, 3.6727734,
        4.156523 , 4.516871 , 4.4099774, 3.5110698, 1.9641333, 3.3476985],
        dtype=float32)}}}

Model Outputs

The model returns a dictionary containing the detected particles for each experiment. The keys of the dictionary are the experiment IDs. The values are dictionaries where keys are particle class names and values are dictionaries containing the detected points and their confidence scores.

Here is an example of the output structure:

{
    'experiment_id_1': {
        'apo-ferritin': {
            'points': array([[x1, y1, z1], [x2, y2, z2], ...]),
            'confidence': array([c1, c2, ...])
        },
        'beta-galactosidase': {
            'points': array([[x3, y3, z3], ...]),
            'confidence': array([c3, ...])
        },
        ...
    }
}

The code in the previous step converts this dictionary into a pandas DataFrame for easier inspection, with the following columns:

id: A unique identifier for each detected particle.
experiment: The experiment ID from which the particle was detected.
particle_type: The predicted class of the particle.
x, y, z: The coordinates of the detected particle in Angstroms.

Contact and Acknowledgments

For issues with this quickstart please contact Koki Kobayashi at koki-kobayashi@outlook.jp.

This model and quickstart were developed by Koki Kobayashi.

Special thanks to the CZI team for their support in developing this repository.

References

CZI Cryo-ET Object Identification Kaggle Competition: https://www.kaggle.com/competitions/czii-cryo-et-object-identification
MonjuDetectHM GitHub Repository: https://github.com/kobakako/MonjuDetectHM
Copick: https://copick.github.io/copick/
CryoET Data Portal: https://cryoetdataportal.czscience.com/

Responsible Use

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.