scLDM

Version v1.0 released 06 Nov 2025

License

Repository

Developed By

Giovanni Palla, Sudarshan Babu, Payam Dibaeinia, Donghui Li, Aly A. Khan, Theofanis Karaletsos, Jakub M. Tomczak, (Chan Zuckerberg Initiative)

scLDM is a scalable latent diffusion model for single-cell gene expression that respects the fundamental exchangeability property of gene measurements. Unlike existing approaches requiring artificial orderings or complex hierarchies, we propose a streamlined VAE using fixed-size latent variables with permutation-invariant and permutation-equivariant components.

Read the Preprint

Associated Resources

Model Details

Model Architecture

scLDM is a latent-diffusion model that consists of a novel fully transformer-based VAE architecture for exchangeable data that uses a single set of fixed-size, permutation-invariant latent variables. The model introduces a Multi-head Cross-Attention Block (MCAB) that serves dual purposes: It acts as a permutation-invariant pooling operator in the encoder, and functions as a permutation-equivariant unpooling operator in the decoder. This unified approach eliminates the need for separate architectural components for handling varying set sizes. Our latent diffusion model is trained with the flow matching loss and linear interpolants using the Scalable Interpolant Transformers formulation (SiT) (Ma et al., 2024), and a denoiser parameterized by Diffusion Transformers (DiT) (Peebles & Xie, 2023). This allows for better modeling of the complex distribution of cellular states and enables controlled generation through classifier-free guidance.

Parameters

We provide multiple variants of our model, trained on various datasets. Our VAE's size varies from a few million parameters to even 270M, while the denoiser network comprises a few to up to 60M of parameters.

Citation

Palla et al. Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models (2025) arXiv (coming soon).

Primary Contact Email

Giovanni Palla gpalla@chanzuckerberg.com

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

Compute Requirements: GPU

Model Variants

Model Variant	Description
scLDM-20M	20 million parameter VAE of the scLDM trained on Human Census Data.
scLDM-70M	70 million parameter VAE of the scLDM trained on Human Census Data.
scLDM-270M	270 million parameter VAE of the scLDM trained on Human Census Data.
scLDM-dentate-gyrus	an scLDM trained on Dentate Gyrus (observational data)
scLDM-tabula-muris	an scLDM trained on Tabula Muris (observational data)
scLDM-hlca	an scLDM trained Human Lung Cell Atlas (observational data)
scLDM-parse1M	an scLDM trained Parse 1M (perturbational data)
scLDM-replogle	an scLDM trained Replogle (perturbational data)

Intended Use

Primary Use Cases

Cell embedding, where a cell is represented by its gene expression
Unconditional synthetic cell generation (i.e., generation of gene expression profiles)
Conditional synthetic cell generation (i.e., generation of gene expression profiles for given information like perturbations)

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Use that violates applicable laws, regulations (including trade compliance laws), or third-party rights such as privacy or intellectual property rights.
Any use that is prohibited by the license.
Any use that is prohibited by the Acceptable Use Policy.
Making clinical diagnoses or providing treatment recommendations. The model is intended for research and informational purposes only.

Training Details

Training Data

In three experiments, we trained 8 variants of our approach:

(i) three models on benchmark observationational datasets: Dentate Gyrus (18k cells, 17k genes), Tabula Muris (245k cells, 20k genes), HLCA (585k cells, 28k genes),
(ii) two models on perturbational datasets: Parse1M (1.27M cells, 2k highly variable genes) and Replogle (624k cells, 2k highly variable genes),
(iii) three models on Human Census Data (around 60M cells).

Training Procedure

All models were trained in two stages using ADAM optimizers: (i) first, a VAE component was trained, (ii) then a latent diffusion part was learnt while the VAE was kept fixed.

Training Code

https://github.com/czi-ai/scldm

Training Hyperparameters

All hyperparameters are presented in the appendix of: Palla et al. Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models (2025) arXiv (coming soon).

Data Sources

The following datasets were used for training and evaluation:

Performance Metrics

Metrics

In our experiments, we used two sets of metrics:

Reconstruction Metrics: We use the reconstruction error for the Negative Binomial distribution, Pearson Correlation Coefficient (PCC), and Mean Squared Errors (MSE) as reconstruction metrics.
Generation Metrics: For evaluating the generation capabilities of models, we use the Maximum Mean Discrepancy (MMD) with the RBF kernel, the Wasserstein Distance, and the Frechet Distance, all calculated on 30 principal components. We compute the PCA on the true data, and project the generated data using the loadings. All evaluations were run using three seeds.

Evaluation Datasets

In all experiments, we used separate test subsets for each dataset used for training.

Evaluation Baselines

In our experiments, we used the following baselines:

Single-cell Variational Inference (scVI) (Lopez et al., 2018)) is a VAE-based generative model designed for single-cell discrete data.
A version of a latent diffusion model for single-cell gene expression data is scDiffusion (Luo et al., 2024).
CFGen is a current state-of-the-art latent diffusion model that builds upon scVI, training a latent flow matching model in the VAE's latent space (Palma et al., 2025).
Compositional Perturbation Autoencoder (CPA) (Lotfollahi et al., 2023) is a deep generative model developed to predict gene expression changes under perturbations and their combinations.

Evaluation Results

We achieved state-of-the-art performance on five cell generation benchmarks, on both observational and perturbational data, as well as on two classification downstream tasks.
In our experiments, we demonstrated that enforcing the inductive bias of exchangeability is critical for generative modeling of single-cell data.

Biases, Risks, and Limitations

Potential Biases

The model may reflect biases present in the training data.
Certain demographic groups may be underrepresented.

Risks

Inaccurate outputs: Like all generative models, the model can generate outputs that sound plausible but are factually incorrect.
Potential misuse for incorrect biological interpretations: The model's outputs could be misinterpreted as confirmed facts, leading to flawed experimental designs or incorrect scientific conclusions if not properly validated.

Limitations

For now, the model is limited to a single species.
The model has not been properly tested to generalize across various cell types, contexts, etc.
The model has not been properly tested on perturbational data (e.g., chemical perturbations)

Caveats and Recommendations

Always review and validate outputs generated by the model.
Treat model outputs as machine-generated hypotheses that require further experimental validation, not as established biological facts.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Should you have any security or privacy issues or questions related to this model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

Acknowledgements

The authors would like to kindly thank Isaac Virshup, and Lakshmi Krishnan for insightful discussions. We also want to thank the AI Infrastructure team for compute resources. Finally, we want to thank Steve Herrin for model packaging and release.

Read the Preprint

Associated Resources