scGenePT

Version v1.0 released 18 Oct 2024

License

Repository

Developed By

Ana-Maria Istrate (Chan Zuckerberg Initiative)

scGenePT is a collection of single-cell models for perturbation prediction. It leverages the scGPT model for scRNAseq data by injecting language embeddings at the gene level into the model architecture. The language gene embeddings are obtained by embedding gene level information from different knowledge sources using LLMs.

Try Model with Demo Dataset

Model Details

Model Architecture

Transformer-based model architecture; model inherited and modified the original scGPT model architecture by incorporating language embeddings as an additional layer for gene representation.

Parameters

51.3M parameters

Fine Tuned From Model

Finetuned from scGPT whole-human: scGPT GitHub Repo,

scGPT whole human model weights

Model Variations

The models and corresponding knowledge sources for gene representations are:

scGenePT_NCBI (scGPT + NCBI Gene Card Summaries)
scGenePT_NCBI+UniProt (scGPT + NCBI Gene Card Summaries + UniProt protein summaries)
scGenePT_GO_F (scGPT + GO Gene Ontology Annotations - Gene Molecular Function)
scGenePT_GO_P (scGPT + GO Gene Ontology Annotations Gene Biological Process)
scGenePT_GO_C (scGPT + GO Gene Ontology Annotations Gene Cellular Component)
scGenePT_GO_all (scGPT + GO-F + GO-P + GO-C)

Citation

Ana-Maria Istrate, Donghui Li, Theofanis Karaletsos (2024). scGenePT: Is language all you need for modeling single-cell perturbations? bioRxiv. DOI: 10.1101/2024.10.23.619972

Model Card Authors

Ana-Maria Istrate (Chan Zuckerberg Initiative)

Primary Contact Email

virtualcellmodels@chanzuckerberg.com

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

Intended Use

Primary Use Cases

Single and two-gene perturbation prediction
- Models trained on Adamson: single-gene
- Models trained on Norman: single-gene, two-gene

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Single-cell tasks outside perturbation prediction, such as:
- Cell-type annotation
- Batch Integration
- Metadata Prediction
- Other single-cell tasks outside of perturbation prediction
Perturbation prediction for multiple gene settings. The models have not been tested in these cases, so behavior is unknown
Any use that is prohibited by the Acceptable Use Policy or MIT License

Training Details

Training Data

Models have been trained on single and two-gene perturbation datasets:

Norman Dataset [4]: 91205 observations, 105 unique single-gene perturbations (48407 observations) and 131 unique two-gene perturbations (35445 observations)
Adamson Dataset [5]: 68603 observations, 86 unique single-gene perturbations

Training Procedure

Finetuning scGenePT on the train splits of the datasets above. Train/val/test splits are obtained from the GEARS package v=0.0.2 (see details below), separately for each dataset.

Preprocessing

Dataloaders containing dataset preprocessing, as well as train/val/test splits for both Adamson and Norman datasets have been obtained from GEARS v=0.0.2. The data in both datasets is log-normalized and filtered to the top 5000 highly variable genes. Code to retrieve the dataset splits from GEARS [3]:

from gears import PertData, GEARS
pert_data = PertData('./data')
pert_data.load(data_name = 'norman') # or adamson
pert_data.prepare_split(split = 'simulation', seed = 1)
pert_data.get_dataloader(batch_size = 32, test_batch_size = 128)

Speeds, Sizes, Times

~1hr training time on a single NVIDIA H100 GPU for finetuning one dataset for 20 epochs

Training Hyperparameters

Adam optimizer, learning rate=1e-4, StepLR (gamma = 0.9), batch_size = 64, dropout=0.2, num_epochs=20

Performance Metrics

Evaluation Protocols

Models are evaluated during training on validation splits. The best model is kept as the one with the lowest MSE on validation splits. Final metrics are reported on test splits, which have not been seen during training.

Evaluation Metrics

Models were evaluated using a range of metrics to measure performance. Key metrics include:

MSE = Mean Squared Error Loss
MSETop20 = MSE on the Top 20 Differentially Expressed genes
Pearson Correlation Score Delta = Pearson Correlation Score between ground truth and prediction post-perturbation response, as compared to control.
Pearson Correlation Score DeltaTop20 = Pearson Correlation Score Delta, Top 20 Differentially Expressed genes

For full details on metrics computation, please refer to the preprint.

Evaluation Datasets

Models trained on Adamson train split have been evaluated on the Adamson test split
Models trained on Norman train split have been evaluated on the Norman test split
The train/val/test splits have been obtained from GEARS v=0.0.2 as described above under Preprocessing

Evaluation Results

Model	Dataset	Pearson Correlation Score Delta	Pearson Correlation Score DeltaTop20	MSE	MSETop20
scGPT	Norman	0.534±0.02	0.665±0.01	0.00421±0.00	0.223±0.01
scGenePT_NCBI	Norman	0.548±0.03	0.685±0.03	0.00415±0.00	0.223±0.03
scGenePT_NCBI+UniProt	Norman	0.557±0.02	0.696±0.01	0.00403±0.00	0.205±0.02
scGenePT_GO_F	Norman	0.554±0.02	0.686±0.03	0.00405±0.00	0.216±0.02
scGenePT_GO_C	Norman	0.550±0.02	0.687±0.02	0.00405±0.00	0.219±0.01
scGenePT_GO_P	Norman	0.543±0.02	0.682±0.02	0.00412±0.00	0.220±0.02
scGenePT_GO_all	Norman	0.554±0.02	0.698±0.02	0.00400±0.00	0.209±0.02

Model	Dataset	Pearson Correlation Score Delta	Pearson Correlation Score DeltaTop20	MSE	MSETop20
scGPT	Adamson	0.589±0.03	0.782±0.02	0.00672±0.00	0.135±0.01
scGenePT_NCBI	Adamson	0.606±0.03	0.779±0.02	0.00654±0.00	0.133±0.00
scGenePT_NCBI+UniProt	Adamson	0.617±0.02	0.784±0.02	0.00620±0.00	0.129±0.00
scGenePT_GO_F	Adamson	0.611±0.03	0.785±0.02	0.00640±0.00	0.128±0.00
scGenePT_GO_C	Adamson	0.623±0.02	0.791±0.01	0.00622±0.00	0.125±0.00
scGenePT_GO_P	Adamson	0.609±0.03	0.789±0.01	0.00645±0.00	0.127±0.00
scGenePT_GO_all	Adamson	0.605±0.03	0.787±0.02	0.00641±0.00	0.127±0.01

Evaluation Metrics URL

Preprint (https://www.biorxiv.org/content/10.1101/2024.10.23.619972v1.full), Section 4: Evaluation

Biases, Risks, and Limitations

Potential Biases

Potential dataset issues in the Norman train/val/test split
During our analyses, we noticed that the 0 seen of 2 test split of the Norman dataset leads to high performance across a number of different baselines, including predicting the effect of a random perturbation and non-ctrl-mean. We believe it is possible that there is a dataset issue, although further investigations are needed. We offer a detailed discussion of this in the manuscript.

Limitations

Models may not work well with multiple gene-perturbations. The effect of predicting has not been evaluated
The metrics used for evaluation are commonly used in the field of perturbation prediction. There are, however, limitations around them, something acknowledged in the field. As the field of perturbation benchmarking evolves, we will continue to re-evaluate the models.

Caveats and Recommendations

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with the models.
Models should be used to generate hypotheses of perturbation predictions for single and two-gene perturbation settings. These hypotheses should be further verified before making any decisions that might affect patient outcomes.

Should you have any security or privacy issues or questions related to this model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

References

[1] Cui, Haotian, et al. "scGPT: toward building a foundation model for single-cell multi-omics using generative AI." Nature Methods (2024): 1-11. Paper Link | GitHub Repo
[2] Chen, Yiqun, and James Zou. "GenePT: a simple but effective foundation model for genes and cells built from ChatGPT." bioRxiv (2024): 2023-10. Paper Link | GitHub Repo
[3] Roohani, Yusuf, Kexin Huang, and Jure Leskovec. "Predicting transcriptional outcomes of novel multigene perturbations with GEARS." Nature Biotechnology 42.6 (2024): 927-935. Paper Link | GitHub Repo
[4] Thomas M Norman, Max A Horlbeck, Joseph M Replogle, Alex Y Ge, Albert Xu, Marco Jost, Luke A Gilbert, and Jonathan S Weissman. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science, 365 (6455):786–793, 2019
[5] Britt Adamson, Thomas M Norman, Marco Jost, Min Y Cho, James K Nuñez, Yuwen Chen, Jacqueline E Villalta, Luke A Gilbert, Max A Horlbeck, Marco Y Hein, et al. A multiplexed single-cell crispr screening platform enables systematic dissection of the unfolded protein response. Cell, 167(7):1867–1882, 2016.

Responsible Use

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.

Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.