scPRINT

Version v1.0 released 01 Jul 2024

License

Repository

Developed By

Jérémie Kalfon (Cantini Lab Institut Pasteur)

scPRINT is a cell foundation model, also called a Large Cell Model (LCM), trained on single-cell RNA sequence (scRNAseq) data from more than 50M human and mouse cells available through CZ CELLxGENE. Based on the transformer architecture, the model is fully open source and reproducible, with multiple checkpoint sizes available from 2M to 100M parameters. scPRINT demonstrated high performance for genome-wide cell-specific gene network inference when benchmarked against state-of-the-art models (e.g., scGPT, Geneformer v2, GENIE3). In addition, scPRINT has various zero-shot capabilities, including cell embedding, cell label prediction (e.g., cell type, sex, disease), and gene expression imputation, highlighting its potential as a versatile tool for single-cell analysis.

Get Started with Model

Model Details

Model Architecture

The main scPRINT model has 16 layers, 8 attention heads, 512 dimensions, and a 512-512-C dimension classifier for each class (cell type, disease, sex, ethnicity, sequencer, organism). It has a gene expression encoding MLP with a size of 1-512-512 to upscale logp1-transformed count data and a gene decoding MLP with a size of 512-512-3 to predict the parameters of a Zero-Inflated Negative-Binomial distribution, a common practice for scRNAseq Neural Networks. scPRINT was trained on 43,000 gene embeddings from ESM2 along with 8 additional learned input cls-pooling tokens.

Parameters

100M

Citation

Kalfon, J., et al. (2025) scPRINT: pre-training on 50 million cells allows robust gene network predictions. Nature Communications 16: 3607. DOI: 10.1038/s41467-025-58699-1

Model Card Authors

Jérémie Kalfon (Institut Pasteur)

Primary Contact Email

Jérémie Kalfon jkobject@gmail.com, Laura Cantini laura.cantini@pasteur.fr.

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

Minimal requirements: 1 CPU with 16 GB of RAM.
Recommended requirements: 1 Nvidia-GPU with 8 GB of GPU memory

Model Variants

Model Variant Name	Description	Access URL
Large (main)	Largest model version (100M parameters)	https://huggingface.co/jkobject/scPRINT/blob/main/large.ckpt
Medium	Medium-sized model version (20M parameters)	https://huggingface.co/jkobject/scPRINT/blob/main/medium.ckpt
v2-medium	Medium-sized model version with more training data (20M parameters)	https://huggingface.co/jkobject/scPRINT/blob/main/v2-medium.ckpt
Small	The smallest model version (7M parameters)	https://huggingface.co/jkobject/scPRINT/blob/main/small.ckpt

Intended Use

Primary Use Cases

Expression denoising (zero-imputation and sequencing-depth increase)
Gene-Network inference
Cell and gene embedding
Cell label prediction (cell type, tissue, disease, sex, ethnicity, sequencer)

Secondary Use Cases

Novel cell expression profile generation using cell embedding arithmetics
Gene knockout transcriptomic effect prediction
Small molecule perturbation transcriptomic effect prediction
Transcriptomic trajectory inference
Cell-specific gene expression embedding

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
Any use that is prohibited by the MIT License.
Any use that is prohibited by the Acceptable Use Policy.

Training Details

Training Data

scPRINT was trained on scRNAseq data from over 50 M cells available through CZ CELLxGENE. Available datasets were processed to remove all the spatial omics datasets, cells with <200 expressed genes, and datasets with low coverage (datasets with <100 cells, <10,000 genes, or from which >95% of the cells were removed). The final training dataset included human and mouse primary cells from 548 datasets representing 54,084,961 cells.

Training Procedure

Pretraining data was downloaded and preprocessed using the procedure detailed in the scdataloader. The command used to preprocess datasets from CZ CELLxGENE can be found on GitHub. Gene location and tokenization were made following the Generate_gene_embeddings notebook. Ontologies and gene names were downloaded using the LaminDB's Bionty framework and presented in the scdataloader's README.

Training Code

The training dataset can be downloaded through LaminDB, following the scripts in scdataloader (see Training Procedure above). Pretraining scripts can be run using the command-line interface (scprint fit ...). Find more information in the scPRINT Github README file.

Speeds, Sizes, Times

The scPRINT medium model size can be trained in three days on an A40 GPU, where inference is at around 640 samples per second with a context size of 3000 genes. Model checkpoints vary from 100 MB to 1 GB.

Training Hyperparameters

Optimizer: fused ADAMW
Weight decay: 0.01
Stochastic weight averaging learning rate: 0.03
Dropout: 0.1
Learning rate (LR): 1e-4 (during pre-training)
Precision: 16-mixed with residuals in fp32
Gradient clip value: 100
Training batches per sub-epoch: 7000
Validation batches per sub-epoch: 2000
Warmup duration: 500 steps
Linear LR decrease factor: 0.6 (across epochs)
LR decrease patience: 1 (epochs)
Early stopping patience: 3 (consecutive increases in validation loss)
Class decoder weight initialization: Normal distribution around 1
Class decoder bias initialization: 0 and -0.12
Batch size: 64
Transformer pre-norm strategy: Yes
Stochastic depth dropout rate: Linearly increasing, 0.02 per layer
Noise parameter: 60%
Train/validation split: 96% train, 2% validation
Test split: 2%
Weighted random sampling factor: 50

Data Sources

The training data was downloaded from CZ CELLxGENE (LTS 2023-12-15 release). The mouse embryonic development dataset by Qiu et. al., 2024 was also included.

Performance Metrics

Metrics

scPRINT was evaluated to measure its performance against GENIE3, scGPT, scFoundation, Geneformer-v2, DeepSEM, CellTypist, and other models from the OpenProblems v1 platform using a range of benchmarks. Key metrics include:

Early Precision Ratio (EPR) and Area Under the Precision-Recall Curve (AUPRC) across a range of datasets and ground truths for Gene Network Inference.
AUPRC on cell-type prediction from the open-problem v1 benchmarks.
scIB scores for batch-effect correction from the open-problem v1 benchmarks.
Percentage improvement in Pearson correlation for expression denoising from three test datasets from human tissues, including ocular anterior segment, retina, and small intestine and colon epithelium.
Gene Networks were benchmarked mostly using the BenGRN package

Evaluation Datasets

Test datasets were selected from the pre-training set via scdataloader. All evaluation datasets can be downloaded from CZ CELLxGENE, including scRNAseq data obtained from:

Evaluation Results

I. Evaluation of gene networks generated by scPRINT and other state-of-the-art models

A. Gene networks (GNs) analysis workflow. Cell-type-specific gene networks were extracted for each cell type in the dataset (n = 26 cell types across three datasets). Gene Set Enrichment Analysis (GSEA) was performed on the network's nodes (n = 4000 genes). The ability of the edges to recover the OmniPath ground truth's connections was calculated.
B. Violin plot of the ten different Area Under the Precision Recall Curve (AUPRC) and Early Precision Ratio (EPR) values obtained when comparing the inferred cell type-specific networks with the OmniPath network for scPRINT model variants, scGPT, DeepSEM, Geneformer v2, and GENIE3, when considering only Transcription Factor (TF)-gene connection or all gene-gene connections. scPRINT model variants include “scPRINT” (average of all attention heads), “scPRINT-genome” (same scPRINT version but computing a genome-wide gene network), and “scPRINT-omnipath's heads” (same scPRINT version but with attention heads selected using a subset of OmniPath).
C. Violin plot of the average number of TF with enrichment for their ENCODE target in each cell-type-specific network.
D. Number of GNs with a significant enrichment of TFs and their cell type's marker genes.

II. Gene network inference performance on cell-type specific ground truths

A. Workflow for generating cell-type specific ground truths. The ground truths were generated via orthogonal sequencing assays on the same cell type. ChIP-seq and perturb-seq were intersected for the MCalla et al. 2022 dataset on human (hESCs) and mouse (mESCs) Embryonic Stem Cells, whereas perturb-seq on the K562 cell line was only used for the genome-wide perturb-seq ground truth.
B. Performance of scPRINT model variants compared to GENIE3, DeepSEM, Geneformer v2, and scGPT on the MCalla et al. ground truth using the AUPRC and EPR on two human and two mouse ESC datasets. scPRINT model variants include “scPRINT” (average of all attention heads), "scPRINT-omnipath's heads" (same scPRINT version but with attention heads selected using a subset of OmniPath), and "scPRINT-Han et al.'s heads" (same scPRINT version but with attention heads selected using a subset of the Han et al.'s ground truth dataset).
C. Performance of scPRINT model variants compared to GENIE3, DeepSEM, Geneformer v2, and scGPT on the genome-wide perturb-seq dataset using Early Precision Ratio (EPR) and Area Under the Precision-Recall Curve (AUPRC) for the K562 cell line. scPRINT model variants include "scPRINT" (average of all attention heads), "scPRINT-omnipath's heads" (same scPRINT version but with attention heads selected using a subset of OmniPath), and "scPRINT-gwps' heads" (same scPRINT version but with attention heads selected using a subset of the genome-wide perturb-seq ground truth).

III. Evaluation of zero-shot tasks beyond gene network inference (denoising, cell type label prediction, batch correction)

A. Performance for the denoising task compared to state-of-the-art methods, MAGIC and knnsmooth2, on three datasets (ciliary body, colon, and retina tissues) from CZ CELLxGENE. A noisy profile was generated by downsampling 70% of the cell transcripts and computing the Spearman correlation increase of the correlation between the denoised and the true profile compared to the one between the noisy and the true profile.
B. Performance on cell-type label prediction compared to state-of-the-art methods and CellTypist, showing accuracy, F1, and macro-F1 scores for the OpenProblems v1 human pancreas dataset.
C. Batch effect correction was assessed for scPRINT, scGPT, and Geneformer v2 using the scIB aggregated score on the human pancreas and lung datasets from the OpenProblems v1 challenge. For comparison, results for unintegrated data (only PCA applied) are also shown.
D. The scIB avgBIO score on the human pancreas and lung datasets from openProblems v1.

Biases, Risks, and Limitations

Potential Biases

The model may reflect biases present in the training data.
While the model trained with a re-weighting scheme, datasets included in the training data do not represent all diseases, cell types, sequencers, demographic groups and other conditions.

Risks

Areas of risk may include but are not limited to:

Inaccurate outputs or hallucinations, especially in the context of low data availability during pre-training.
Potential misuse for incorrect biological interpretations.

Limitations

The model has only been tested on the use cases presented in the paper using human and mouse data.
Running the model without GPUs might be very slow.
For now, the model does not work on non-coding RNAs.

Caveats and Recommendations

Review and validate outputs generated by the model.
When using scPRINT for the first time, use jointly with a simpler approach as a benchmark to ensure that scPRINT outputs make sense.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Acknowledgements

The project leading to this manuscript has received funding from the Inception program (Investissement d'Avenir grant ANR-16-CONV-0005) L.C. and the European Union (ERC StG, MULTIview-CELL, 101115618) L.C. We acknowledge the help of the HPC Core Facility of the Institut Pasteur and Déborah Philipps for the administrative support. The work of G. Peyré was supported by the French government under management of Agence Nationale de la Recherche as part of the "Investissements d'avenir" program, reference ANR19-P3IA-0001 (PRAIRIE 3IA Institute).

If you have recommendations for this model card please contact virtualcellmodels@chanzuckerberg.com.

Get Started with Model