Train TopCUP Model to Extract Protein Particles in CryoET Dataset
Estimated time to complete: 20 minutes
Learning Goals
- Create a copick configuration file for loading cryoET dataset.
- Train TopCUP models and automatically save best checkpoints via its CLI.
Prerequisites
- The TopCUP model requires
python>=3.10. At the time of publication, Colab defaults to Python 3.12 - This model requires a minimum T4 GPU to run.
Introduction
The Top CryoET U-Net Picker (TopCUP) is a 3D U-Net–based ensemble model designed for particle picking in cryo-electron tomography (cryoET) volumes. It uses a segmentation heatmap approach to identify particle locations. TopCUP is fully integrated with copick, a flexible cryoET dataset API developed at the Chan Zuckerberg Imaging Institute (CZII). This integration makes it easy to apply the model directly to any cryoET dataset in copick format.
For this tutorial, we will use seven tomograms from the Experimental Training Dataset (Dataset ID: DS-10440), which is the same dataset used in the Kaggle CryoET Challenge. Now that this dataset is publicly available on the CZ CryoET Data Portal, we can stream it directly using the copick configuration file provided below. We can automatically generate a copick configuration file from copick API, and add metadata for each particles for training TopCUP models.
Setup
A copick configuration file is required as input.
The copick configuration file must define pickable objects (i.e., the protein complexes you want to detect) and three key metadata parameters for each object:
score_weight: weight for each class in the DenseCrossEntropy lossscore_threshold: threshold to filter final picks per class, reducing false positivesscore_weight: weight for each class in the F-beta score evaluation
You can find additional instructions and template configurations for accessing datasets across different platforms from the official copick page.
An example of a copick file is linked here at the model Github.
Installation
First, download the repository, which will also install required packages.
!pip install git+https://github.com/czimaginginstitute/czii_cryoet_mlchallenge_winning_models.gitCopick Configuration File
The code below adds metadata for the particles and streams in our copick file.
import os, copick
metadata = {
"ferritin-complex": {
"score_weight": 1,
"score_threshold": 0.16,
"class_loss_weight": 256
},
"thyroglobulin": {
"score_weight": 2,
"score_threshold": 0.18,
"class_loss_weight": 256
},
"beta-galactosidase": {
"score_weight": 2,
"score_threshold": 0.13,
"class_loss_weight": 256
},
"beta-amylase": {
"score_weight": 0,
"score_threshold": 0.25,
"class_loss_weight": 256
},
"cytosolic-ribosome": {
"score_weight": 1,
"score_threshold": 0.19,
"class_loss_weight": 256
},
"virus-like-capsid": {
"score_weight": 1,
"score_threshold": 0.5,
"class_loss_weight": 256
}
}
copick_config_path = os.path.abspath('./training_copick_config_portal.json')
overlay_path = os.path.abspath('./tmp_overlay')
copick_root = copick.from_czcdp_datasets(
[10440], #dataset_ids
overlay_path,
{'auto_mkdir': True}, #overlay_root, self-defined
output_path = copick_config_path,
)
# only consider the 6 particles
config_pickable_objects = []
for p in copick_root.config.pickable_objects:
if p.name in metadata:
p.metadata = metadata[p.name]
config_pickable_objects.append(p)
copick_root.config.pickable_objects = config_pickable_objects
# save the copick config for later use
copick_root.save_config(copick_config_path)Additional Copick Command Options
You can explore dataset-specific options such as run_names, pixelsize, tomo_type, and annotator user_id using the copick API.
# Check available run names
for run in copick_root.runs:
pss = [str(vs.voxel_size) for vs in run.voxel_spacings]
ps = ','.join(set(pss))
users = [p.user_id for p in run.picks]
urs = ','.join(set(users))
print(f"run name: {run.name}, annotation user_id: {urs}, available voxelsize/pixelsize: {ps} A")Output:
run name: 16463, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16464, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16465, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16466, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16467, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16468, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A
run name: 16469, annotation user_id: data-portal, available voxelsize/pixelsize: 4.99,10.012 A# Get a single run
run = copick_root.get_run('16463')
voxel_spacing_obj = run.get_voxel_spacing(10.012)
# Check available reconstruction_type
tts = [t.tomo_type for t in voxel_spacing_obj.tomograms]
tt = ','.join(tts)
print(f'run {run.name} has tomogram_type: {tt}')Output:
run 16463 has tomogram_type: wbp-denoised-denoiset-ctfdeconv,wbp-filtered-ctfdeconvTopCUP CLI Commands
To explore the available options for running TopCUP, use the --help flag. In your terminal, run topcup train --help. This will display all command-line options and arguments for running TopCUP training, see below:
Usage: topcup train [OPTIONS]
Options:
-c, --copick_config FILE copick config file path [required]
-tts, --train_run_names TEXT Tomogram dataset run names for training
[required]
-vts, --val_run_names TEXT Tomogram dataset run names for validation
[required]
-tt, --tomo_type TEXT Tomogram type. Default is denoised.
-u, --user_id TEXT Needed for training, the user_id used for the
ground truth picks.
-s, --session_id TEXT Needed for training, the session_id used for
the ground truth picks. Default is None.
-bs, --batch_size INTEGER batch size for data loader
-n, --n_aug INTEGER Data augmentation copy. Default is 1112.
-l, --learning_rate FLOAT Learning rate for optimizer
-p, --pretrained_weight TEXT One pretrained weights file path. Default is
None.
-e, --epochs INTEGER Number of epochs. Default is 100.
--pixelsize FLOAT Pixelsize in angstrom. Default is 10.0A.
-o, --output_dir TEXT output dir for saving checkpoints
-v, --logger_version INTEGER PyTorch-Lightning logger version. If not set,
logs and outputs will increment to the next
version.
-h, --help Show this message and exit.Training
Next we will train the model through the TopCUP CLI. Training the model takes about 19 minutes per epoch using a batch size of 4. Having data downloaded locally can shorten the data loading overhead per epoch. For this tutorial, we will only train the model for one epoch.
#Code for running model training in Jupyter with live printouts. You can also run the commands directly in a terminal.
from topcup.cli.cli import cli
training_outputs = os.path.abspath('./outputs_training')
cli.main(
args=[
"train",
"-c", f"{str(copick_config_path)}",
"-u", "data-portal",
"-tts", "16463,16464,16465,16466,16467,16468",
"-vts", "16469",
"-bs", "4",
"-n", "16", # use default value to replicate the performance
"-o", f"{str(training_outputs)}",
"--pixelsize", "10.012",
"-tt", "wbp-denoised-denoiset-ctfdeconv",
"-v", "0",
"-e", "1"
],
standalone_mode=False, # so Click doesn’t exit on exceptions
)Analysis of Model Outputs
The model will automatically track the validation performance and save the best checkpoint and history metrics inside the specified output directory. The evaluation score for each epoch will be shown in the printouts. The output directory can be changed using the -o flag.
Summary
In this tutorial we streamed in a copick configuration file, trained the topCUP model and saved the best checkpoints via CLI in the specific output directory.
Contact and Acknowledgments
For issues with this notebook please contact kevin.zhao@czii.org.
Special thank you to Christof Henkel for developing the segmentation models and Utz Ermel for developing copick.
References
- Peck, A., et al., (2025) A Realistic Phantom Dataset for Benchmarking Cryo-ET Data Annotation. Nature Methods. DOI: 10.1101/2024.11.04.621686
Responsible Use
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.
Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.