scGenePT
Version v1.0 released 18 Oct 2024
- Ana-Maria Istrate (Chan Zuckerberg Initiative)
scGenePT is a collection of single-cell models for perturbation prediction. It leverages the scGPT foundation model for scRNAseq data by injecting language embeddings at the gene level into the model architecture. The language gene embeddings are obtained by embedding gene level information from different knowledge sources using LLMs.
Model Details
Model Architecture
Transformer-based model architecture; model inherited and modified the original scGPT model architecture by incorporating language embeddings as an additional layer for gene representation.
Parameters
51.3M parameters
Fine Tuned From Model
Finetuned from scGPT whole-human: scGPT GitHub Repo,
scGPT whole human model weightsModel URI
aws s3 ls --no-sign-request s3://czi-scgenept-public/models/finetuned/
Model Variations
The models and corresponding knowledge sources for gene representations are:
- scGenePT_NCBI (scGPT + NCBI Gene Card Summaries)
- scGenePT_NCBI+UniProt (scGPT + NCBI Gene Card Summaries + UniProt protein summaries)
- scGenePT_GO_F (scGPT + GO Gene Ontology Annotations - Gene Molecular Function)
- scGenePT_GO_P (scGPT + GO Gene Ontology Annotations Gene Biological Process)
- scGenePT_GO_C (scGPT + GO Gene Ontology Annotations Gene Cellular Component)
- scGenePT_GO_all (scGPT + GO-F + GO-P + GO-C)
Citation
https://www.biorxiv.org/content/10.1101/2024.10.23.619972v1Model Card Author
Ana-Maria Istrate (Chan Zuckerberg Initiative)
Model Card Contact
virtualcellmodels@chanzuckerberg.comIntended Use
Primary Use Cases
- Single and two-gene perturbation prediction
- Models trained on Adamson: single-gene
- Models trained on Norman: single-gene, two-gene
Out-of-Scope or Unauthorized Use Cases
Do not use the model for the following purposes:
- Single-cell tasks outside perturbation prediction, such as:
- Cell-type annotation
- Batch Integration
- Metadata Prediction
- Other single-cell tasks outside of perturbation prediction
- Perturbation prediction for multiple gene settings. The models have not been tested in these cases, so behavior is unknown
- Any use that is prohibited by the Acceptable Use Policy or MIT License
Training Details
Training Data
Models have been trained on single and two-gene perturbation datasets:
- Norman Dataset [4]: 91205 observations, 105 unique single-gene perturbations (48407 observations) and 131 unique two-gene perturbations (35445 observations)
- Adamson Dataset [5]: 68603 observations, 86 unique single-gene perturbations
Training Procedure
Finetuning scGenePT on the train splits of the datasets above. Train/val/test splits are obtained from the GEARS package v=0.0.2 (see details below), separately for each dataset.
Preprocessing
Dataloaders containing dataset preprocessing, as well as train/val/test splits for both Adamson and Norman datasets have been obtained from GEARS v=0.0.2. The data in both datasets is log-normalized and filtered to the top 5000 highly variable genes. Code to retrieve the dataset splits from GEARS [3]:
from gears import PertData, GEARS
pert_data = PertData('./data')
pert_data.load(data_name = 'norman') # or adamson
pert_data.prepare_split(split = 'simulation', seed = 1)
pert_data.get_dataloader(batch_size = 32, test_batch_size = 128)
Speeds, Sizes, Times
- ~1hr training time on a single NVIDIA H100 GPU for finetuning one dataset for 20 epochs
Training Hyperparameters
- Adam optimizer, learning rate=1e-4, StepLR (gamma = 0.9), batch_size = 64, dropout=0.2, num_epochs=20
Performance Metrics
Evaluation Protocols
Models are evaluated during training on validation splits. The best model is kept as the one with the lowest MSE on validation splits. Final metrics are reported on test splits, which have not been seen during training.
Evaluation Metrics
Models were evaluated using a range of metrics to measure performance. Key metrics include:
- MSE = Mean Squared Error Loss
- MSETop20 = MSE on the Top 20 Differentially Expressed genes
- Pearson Correlation Score Delta = Pearson Correlation Score between ground truth and prediction post-perturbation response, as compared to control.
- Pearson Correlation Score DeltaTop20 = Pearson Correlation Score Delta, Top 20 Differentially Expressed genes
For full details on metrics computation, please refer to the preprint.
Evaluation Datasets
- Models trained on Adamson train split have been evaluated on the Adamson test split
- Models trained on Norman train split have been evaluated on the Norman test split
- The train/val/test splits have been obtained from GEARS v=0.0.2 as described above under Preprocessing
Evaluation Results
Model | Dataset | Pearson Correlation Score Delta | Pearson Correlation Score DeltaTop20 | MSE | MSETop20 |
---|---|---|---|---|---|
scGPT | Norman | 0.534±0.02 | 0.665±0.01 | 0.00421±0.00 | 0.223±0.01 |
scGenePT_NCBI | Norman | 0.548±0.03 | 0.685±0.03 | 0.00415±0.00 | 0.223±0.03 |
scGenePT_NCBI+UniProt | Norman | 0.557±0.02 | 0.696±0.01 | 0.00403±0.00 | 0.205±0.02 |
scGenePT_GO_F | Norman | 0.554±0.02 | 0.686±0.03 | 0.00405±0.00 | 0.216±0.02 |
scGenePT_GO_C | Norman | 0.550±0.02 | 0.687±0.02 | 0.00405±0.00 | 0.219±0.01 |
scGenePT_GO_P | Norman | 0.543±0.02 | 0.682±0.02 | 0.00412±0.00 | 0.220±0.02 |
scGenePT_GO_all | Norman | 0.554±0.02 | 0.698±0.02 | 0.00400±0.00 | 0.209±0.02 |
Model | Dataset | Pearson Correlation Score Delta | Pearson Correlation Score DeltaTop20 | MSE | MSETop20 |
---|---|---|---|---|---|
scGPT | Adamson | 0.589±0.03 | 0.782±0.02 | 0.00672±0.00 | 0.135±0.01 |
scGenePT_NCBI | Adamson | 0.606±0.03 | 0.779±0.02 | 0.00654±0.00 | 0.133±0.00 |
scGenePT_NCBI+UniProt | Adamson | 0.617±0.02 | 0.784±0.02 | 0.00620±0.00 | 0.129±0.00 |
scGenePT_GO_F | Adamson | 0.611±0.03 | 0.785±0.02 | 0.00640±0.00 | 0.128±0.00 |
scGenePT_GO_C | Adamson | 0.623±0.02 | 0.791±0.01 | 0.00622±0.00 | 0.125±0.00 |
scGenePT_GO_P | Adamson | 0.609±0.03 | 0.789±0.01 | 0.00645±0.00 | 0.127±0.00 |
scGenePT_GO_all | Adamson | 0.605±0.03 | 0.787±0.02 | 0.00641±0.00 | 0.127±0.01 |
Evaluation Metrics URL
- Preprint (https://www.biorxiv.org/content/10.1101/2024.10.23.619972v1.full), Section 4: Evaluation
Bias, Risks, and Limitations
Potential Biases
- Potential dataset issues in the Norman train/val/test split
- During our analyses, we noticed that the 0 seen of 2 test split of the Norman dataset leads to high performance across a number of different baselines, including predicting the effect of a random perturbation and non-ctrl-mean. We believe it is possible that there is a dataset issue, although further investigations are needed. We offer a detailed discussion of this in the manuscript.
Limitations
- Models may not work well with multiple gene-perturbations. The effect of predicting has not been evaluated
- The metrics used for evaluation are commonly used in the field of perturbation prediction. There are, however, limitations around them, something acknowledged in the field. As the field of perturbation benchmarking evolves, we will continue to re-evaluate the models.
Caveats and Recommendations
- We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with the models.
- Models should be used to generate hypotheses of perturbation predictions for single and two-gene perturbation settings. These hypotheses should be further verified before making any decisions that might affect patient outcomes.
Should you have any security or privacy issues or questions related to this model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.
References
- [1] Cui, Haotian, et al. "scGPT: toward building a foundation model for single-cell multi-omics using generative AI." Nature Methods (2024): 1-11. Paper Link | GitHub Repo
- [2] Chen, Yiqun, and James Zou. "GenePT: a simple but effective foundation model for genes and cells built from ChatGPT." bioRxiv (2024): 2023-10. Paper Link | GitHub Repo
- [3] Roohani, Yusuf, Kexin Huang, and Jure Leskovec. "Predicting transcriptional outcomes of novel multigene perturbations with GEARS." Nature Biotechnology 42.6 (2024): 927-935. Paper Link | GitHub Repo
- [4] Thomas M Norman, Max A Horlbeck, Joseph M Replogle, Alex Y Ge, Albert Xu, Marco Jost, Luke A Gilbert, and Jonathan S Weissman. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science, 365 (6455):786–793, 2019
- [5] Britt Adamson, Thomas M Norman, Marco Jost, Min Y Cho, James K Nuñez, Yuwen Chen, Jacqueline E Villalta, Luke A Gilbert, Max A Horlbeck, Marco Y Hein, et al. A multiplexed single-cell crispr screening platform enables systematic dissection of the unfolded protein response. Cell, 167(7):1867–1882, 2016.
Responsible Use
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.
Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.