scLDM
Version v1.0 released 06 Nov 2025
License
MITRepository
https://github.com/czi-ai/scldmDeveloped By
Giovanni Palla, Sudarshan Babu, Payam Dibaeinia, Donghui Li, Aly A. Khan, Theofanis Karaletsos, Jakub M. Tomczak, (Chan Zuckerberg Initiative)
scLDM is a scalable latent diffusion model for single-cell gene expression that respects the fundamental exchangeability property of gene measurements. Unlike existing approaches requiring artificial orderings or complex hierarchies, we propose a streamlined VAE using fixed-size latent variables with permutation-invariant and permutation-equivariant components.
Model Details
Model Architecture
scLDM is a latent-diffusion model that consists of a novel fully transformer-based VAE architecture for exchangeable data that uses a single set of fixed-size, permutation-invariant latent variables. The model introduces a Multi-head Cross-Attention Block (MCAB) that serves dual purposes: It acts as a permutation-invariant pooling operator in the encoder, and functions as a permutation-equivariant unpooling operator in the decoder. This unified approach eliminates the need for separate architectural components for handling varying set sizes. Our latent diffusion model is trained with the flow matching loss and linear interpolants using the Scalable Interpolant Transformers formulation (SiT) (Ma et al., 2024), and a denoiser parameterized by Diffusion Transformers (DiT) (Peebles & Xie, 2023). This allows for better modeling of the complex distribution of cellular states and enables controlled generation through classifier-free guidance.
Parameters
We provide multiple variants of our model, trained on various datasets. Our VAE's size varies from a few million parameters to even 270M, while the denoiser network comprises a few to up to 60M of parameters.
Citation
Palla et al. Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models (2025) arXiv (coming soon).
Primary Contact Email
Giovanni Palla gpalla@chanzuckerberg.com
To submit feature requests or report issues with the model, please open an issue on the GitHub repository.
System Requirements
Compute Requirements: GPU
Model Variants
Model Variant | Description |
|---|---|
| scLDM-20M | 20 million parameter VAE of the scLDM trained on Human Census Data. |
| scLDM-70M | 70 million parameter VAE of the scLDM trained on Human Census Data. |
| scLDM-270M | 270 million parameter VAE of the scLDM trained on Human Census Data. |
| scLDM-dentate-gyrus | an scLDM trained on Dentate Gyrus (observational data) |
| scLDM-tabula-muris | an scLDM trained on Tabula Muris (observational data) |
| scLDM-hlca | an scLDM trained Human Lung Cell Atlas (observational data) |
| scLDM-parse1M | an scLDM trained Parse 1M (perturbational data) |
| scLDM-replogle | an scLDM trained Replogle (perturbational data) |
Intended Use
Primary Use Cases
- Cell embedding, where a cell is represented by its gene expression
- Unconditional synthetic cell generation (i.e., generation of gene expression profiles)
- Conditional synthetic cell generation (i.e., generation of gene expression profiles for given information like perturbations)
Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Use that violates applicable laws, regulations (including trade compliance laws), or third-party rights such as privacy or intellectual property rights.
- Any use that is prohibited by the license.
- Any use that is prohibited by the Acceptable Use Policy.
- Making clinical diagnoses or providing treatment recommendations. The model is intended for research and informational purposes only.
Training Details
Training Data
In three experiments, we trained 8 variants of our approach:
- (i) three models on benchmark observationational datasets: Dentate Gyrus (18k cells, 17k genes), Tabula Muris (245k cells, 20k genes), HLCA (585k cells, 28k genes),
- (ii) two models on perturbational datasets: Parse1M (1.27M cells, 2k highly variable genes) and Replogle (624k cells, 2k highly variable genes),
- (iii) three models on Human Census Data (around 60M cells).
Training Procedure
All models were trained in two stages using ADAM optimizers: (i) first, a VAE component was trained, (ii) then a latent diffusion part was learnt while the VAE was kept fixed.
Training Code
https://github.com/czi-ai/scldmTraining Hyperparameters
All hyperparameters are presented in the appendix of: Palla et al. Scalable Single-Cell Gene Expression Generation with Latent Diffusion Models (2025) arXiv (coming soon).
Data Sources
The following datasets were used for training and evaluation:
Performance Metrics
Metrics
In our experiments, we used two sets of metrics:
- Reconstruction Metrics: We use the reconstruction error for the Negative Binomial distribution, Pearson Correlation Coefficient (PCC), and Mean Squared Errors (MSE) as reconstruction metrics.
- Generation Metrics: For evaluating the generation capabilities of models, we use the Maximum Mean Discrepancy (MMD) with the RBF kernel, the Wasserstein Distance, and the Frechet Distance, all calculated on 30 principal components. We compute the PCA on the true data, and project the generated data using the loadings. All evaluations were run using three seeds.
Evaluation Datasets
In all experiments, we used separate test subsets for each dataset used for training.
Evaluation Baselines
In our experiments, we used the following baselines:
- Single-cell Variational Inference (scVI) (Lopez et al., 2018)) is a VAE-based generative model designed for single-cell discrete data.
- A version of a latent diffusion model for single-cell gene expression data is scDiffusion (Luo et al., 2024).
- CFGen is a current state-of-the-art latent diffusion model that builds upon scVI, training a latent flow matching model in the VAE's latent space (Palma et al., 2025).
- Compositional Perturbation Autoencoder (CPA) (Lotfollahi et al., 2023) is a deep generative model developed to predict gene expression changes under perturbations and their combinations.
Evaluation Results
- We achieved state-of-the-art performance on five cell generation benchmarks, on both observational and perturbational data, as well as on two classification downstream tasks.
- In our experiments, we demonstrated that enforcing the inductive bias of exchangeability is critical for generative modeling of single-cell data.
Biases, Risks, and Limitations
Potential Biases
- The model may reflect biases present in the training data.
- Certain demographic groups may be underrepresented.
Risks
- Inaccurate outputs: Like all generative models, the model can generate outputs that sound plausible but are factually incorrect.
- Potential misuse for incorrect biological interpretations: The model's outputs could be misinterpreted as confirmed facts, leading to flawed experimental designs or incorrect scientific conclusions if not properly validated.
Limitations
- For now, the model is limited to a single species.
- The model has not been properly tested to generalize across various cell types, contexts, etc.
- The model has not been properly tested on perturbational data (e.g., chemical perturbations)
Caveats and Recommendations
- Always review and validate outputs generated by the model.
- Treat model outputs as machine-generated hypotheses that require further experimental validation, not as established biological facts.
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.
Should you have any security or privacy issues or questions related to this model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.
Acknowledgements
The authors would like to kindly thank Isaac Virshup, and Lakshmi Krishnan for insightful discussions. We also want to thank the AI Infrastructure team for compute resources. Finally, we want to thank Steve Herrin for model packaging and release.