scGenePT

Version v1.0 released 18 Oct 2024

Developed By
  • Ana-Maria Istrate (Chan Zuckerberg Initiative)

scGenePT is a collection of single-cell models for perturbation prediction. It leverages the scGPT foundation model for scRNAseq data by injecting language embeddings at the gene level into the model architecture. The language gene embeddings are obtained by embedding gene level information from different knowledge sources using LLMs.

Model Details

Model Architecture

Transformer-based model architecture; model inherited and modified the original scGPT model architecture by incorporating language embeddings as an additional layer for gene representation.

Parameters

51.3M parameters

Fine Tuned From Model

Finetuned from scGPT whole-human: scGPT GitHub Repo,

scGPT whole human model weights

Model URI

aws s3 ls --no-sign-request s3://czi-scgenept-public/models/finetuned/

Model Variations

The models and corresponding knowledge sources for gene representations are:

  • scGenePT_NCBI (scGPT + NCBI Gene Card Summaries)
  • scGenePT_NCBI+UniProt (scGPT + NCBI Gene Card Summaries + UniProt protein summaries)
  • scGenePT_GO_F (scGPT + GO Gene Ontology Annotations - Gene Molecular Function)
  • scGenePT_GO_P (scGPT + GO Gene Ontology Annotations Gene Biological Process)
  • scGenePT_GO_C (scGPT + GO Gene Ontology Annotations Gene Cellular Component)
  • scGenePT_GO_all (scGPT + GO-F + GO-P + GO-C)

Citation

https://www.biorxiv.org/content/10.1101/2024.10.23.619972v1

Model Card Author

Ana-Maria Istrate (Chan Zuckerberg Initiative)

Model Card Contact

virtualcellmodels@chanzuckerberg.com

Intended Use

Primary Use Cases

  • Single and two-gene perturbation prediction
    • Models trained on Adamson: single-gene
    • Models trained on Norman: single-gene, two-gene

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

  • Single-cell tasks outside perturbation prediction, such as:
    • Cell-type annotation
    • Batch Integration
    • Metadata Prediction
    • Other single-cell tasks outside of perturbation prediction
  • Perturbation prediction for multiple gene settings. The models have not been tested in these cases, so behavior is unknown
  • Any use that is prohibited by the Acceptable Use Policy or MIT License

Training Details

Training Data

Models have been trained on single and two-gene perturbation datasets:

  • Norman Dataset [4]: 91205 observations, 105 unique single-gene perturbations (48407 observations) and 131 unique two-gene perturbations (35445 observations)
  • Adamson Dataset [5]: 68603 observations, 86 unique single-gene perturbations

Training Procedure

Finetuning scGenePT on the train splits of the datasets above. Train/val/test splits are obtained from the GEARS package v=0.0.2 (see details below), separately for each dataset.

Preprocessing

Dataloaders containing dataset preprocessing, as well as train/val/test splits for both Adamson and Norman datasets have been obtained from GEARS v=0.0.2. The data in both datasets is log-normalized and filtered to the top 5000 highly variable genes. Code to retrieve the dataset splits from GEARS [3]:

from gears import PertData, GEARS
pert_data = PertData('./data')
pert_data.load(data_name = 'norman') # or adamson
pert_data.prepare_split(split = 'simulation', seed = 1)
pert_data.get_dataloader(batch_size = 32, test_batch_size = 128)

Speeds, Sizes, Times

  • ~1hr training time on a single NVIDIA H100 GPU for finetuning one dataset for 20 epochs

Training Hyperparameters

  • Adam optimizer, learning rate=1e-4, StepLR (gamma = 0.9), batch_size = 64, dropout=0.2, num_epochs=20

Performance Metrics

Evaluation Protocols

Models are evaluated during training on validation splits. The best model is kept as the one with the lowest MSE on validation splits. Final metrics are reported on test splits, which have not been seen during training.

Evaluation Metrics

Models were evaluated using a range of metrics to measure performance. Key metrics include:

  • MSE = Mean Squared Error Loss
  • MSETop20 = MSE on the Top 20 Differentially Expressed genes
  • Pearson Correlation Score Delta = Pearson Correlation Score between ground truth and prediction post-perturbation response, as compared to control.
  • Pearson Correlation Score DeltaTop20 = Pearson Correlation Score Delta, Top 20 Differentially Expressed genes

For full details on metrics computation, please refer to the preprint.

Evaluation Datasets

  • Models trained on Adamson train split have been evaluated on the Adamson test split
  • Models trained on Norman train split have been evaluated on the Norman test split
  • The train/val/test splits have been obtained from GEARS v=0.0.2 as described above under Preprocessing

Evaluation Results

Model
Dataset
Pearson Correlation Score Delta
Pearson Correlation Score DeltaTop20
MSE
MSETop20
scGPT
Norman
0.534±0.02
0.665±0.01
0.00421±0.00
0.223±0.01
scGenePT_NCBI
Norman
0.548±0.03
0.685±0.03
0.00415±0.00
0.223±0.03
scGenePT_NCBI+UniProt
Norman
0.557±0.02
0.696±0.01
0.00403±0.00
0.205±0.02
scGenePT_GO_F
Norman
0.554±0.02
0.686±0.03
0.00405±0.00
0.216±0.02
scGenePT_GO_C
Norman
0.550±0.02
0.687±0.02
0.00405±0.00
0.219±0.01
scGenePT_GO_P
Norman
0.543±0.02
0.682±0.02
0.00412±0.00
0.220±0.02
scGenePT_GO_all
Norman
0.554±0.02
0.698±0.02
0.00400±0.00
0.209±0.02
Model
Dataset
Pearson Correlation Score Delta
Pearson Correlation Score DeltaTop20
MSE
MSETop20
scGPT
Adamson
0.589±0.03
0.782±0.02
0.00672±0.00
0.135±0.01
scGenePT_NCBI
Adamson
0.606±0.03
0.779±0.02
0.00654±0.00
0.133±0.00
scGenePT_NCBI+UniProt
Adamson
0.617±0.02
0.784±0.02
0.00620±0.00
0.129±0.00
scGenePT_GO_F
Adamson
0.611±0.03
0.785±0.02
0.00640±0.00
0.128±0.00
scGenePT_GO_C
Adamson
0.623±0.02
0.791±0.01
0.00622±0.00
0.125±0.00
scGenePT_GO_P
Adamson
0.609±0.03
0.789±0.01
0.00645±0.00
0.127±0.00
scGenePT_GO_all
Adamson
0.605±0.03
0.787±0.02
0.00641±0.00
0.127±0.01

Evaluation Metrics URL

Bias, Risks, and Limitations

Potential Biases

  • Potential dataset issues in the Norman train/val/test split
  • During our analyses, we noticed that the 0 seen of 2 test split of the Norman dataset leads to high performance across a number of different baselines, including predicting the effect of a random perturbation and non-ctrl-mean. We believe it is possible that there is a dataset issue, although further investigations are needed. We offer a detailed discussion of this in the manuscript.

Limitations

  • Models may not work well with multiple gene-perturbations. The effect of predicting has not been evaluated
  • The metrics used for evaluation are commonly used in the field of perturbation prediction. There are, however, limitations around them, something acknowledged in the field. As the field of perturbation benchmarking evolves, we will continue to re-evaluate the models.

Caveats and Recommendations

  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with the models.
  • Models should be used to generate hypotheses of perturbation predictions for single and two-gene perturbation settings. These hypotheses should be further verified before making any decisions that might affect patient outcomes.

Should you have any security or privacy issues or questions related to this model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

References

Responsible Use

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.

Should you have any security or privacy issues or questions related to the services, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.