GREmLN

Version v0.0.1 released 09 Jul 2025

License

Repository

Developed By

Mingxuan Zhang (Columbia University)

GREmLN is a graph-aware model for single-cell RNA-seq that integrates gene-regulatory network structure into its attention mechanism to address the missing positional order of gene tokens. By inducing a graph-based ordering, the model learns biologically informed gene embeddings that linearly reconstruct expression profiles and set new benchmarks in predicting unseen cell types and regulatory structures. The approach scales to arbitrary molecular-interaction graphs, gains further accuracy from validated edges and diffusion kernels, and opens avenues for fine-tuning embeddings to model combinatorial perturbations and reveal core regulatory modules through attention-based interpretability.

Get Started with Model

Model Details

Model Architecture

Transformer model dimension: 512
Transformer blocks: 3
Attention heads per block: 8
Activation GeLU

Parameters

10.3 million

Citation

Zhang, M., et al. (2025) GREmLN: A Cellular Regulatory Network-Aware Transcriptomics Foundation Model. bioRxiv 2025.07.03.663009; DOI: 10.1101/2025.07.03.663009v1.

Model Card Author

Mingxuan Zhang

Primary Contact Email

Mingxuan Zhang mz2934@columbia.edu

To submit feature requests or report issues with the model, please open an issue on the GitHub repository.

System Requirements

For inference only use T4 or L4 GPU. The model is pre-trained on 8 H100 80G GPUs.

Intended Use

Primary Use Cases

Cell type classification
Perturbation prediction
Gene regulatory network refinement

Out-of-Scope or Unauthorized Use Cases

Do not use the model for the following purposes:

Use that violates applicable laws, regulations (including trade compliance laws), or third party rights such as privacy or intellectual property rights.
Any use that is prohibited by the MIT license or the Acceptable Use Policy.

Training Details

Training Data

The pre-training dataset consists of 11M scRNA-seq profiles spanning 19K genes from healthy human cells, sourced from the expansive CZ CELLxGENE dataset. Our single cell dataset covers 162 cell-types from various tissues (heart, blood, brain, lung, kidney, intestine, pancreas, and others).

Training Procedure

Our pre-processing pipeline consists of the following steps: (1) quality control by UMI count and log1p normalization of expression values, (2) quantization of expression values into 100 discrete bins for computational efficiency, (3) cell-type specific Gene Regulatory Network generation using the ARACNe algorithm(any GRN inference algorithms work here), and (4) curating single-cell-specific networks from the cell-type-specific GRNs as the subgraph of a cell's expressed genes, followed by splitting the dataset into unseen-graph validation, seen-graph validation, and training sets.

Hyperparameter searches are performed on both validation datasets. Tokenization is done by aggregating gene expression into quantile bins and embedding gene ID. We apply a dynamic learning rate scheme to ensure scalable learning behavior. At the start of pre-training, we first apply a linear ramping of learning rate with start factor 1e-5 that lasts for 10% of the total steps. After peak learning rate is achieved, we apply cosine annealing for the rest of the training steps to promote generalization and avoid forgetting.

Training Hyperparameters

Peak learning rate: 3e-4
Batch size: 16
Chebyshev truncation index(K): 64
Diffusion rate(β): 0.1
Number of quadrature points(N): 100
Graph integration false edge threshold(α): 0.05

Data Sources

All training data was sourced from CZ CELLx GENE (Census release: 2024-07-01). Specific data curation was performed following steps outlined for scGPT (see instructions in GitHub).

Performance Metrics

Metrics

GREmLN was evaluated using a range of benchmarks to measure its performance. More specifically we evaluated two key tasks, including cell type annotation and gene regulatory network structure understanding. Cell type annotation was evaluated using precision, recall, marco-F1 and accuracy metrics. The model's understanding of gene regulatory networks was assessed using the ROC curve, AUROC, precision-recall curve, and average precision.

Evaluation Datasets

GREmLN's performance was evaluated using the following datasets:

Human Immune Cells: This dataset includes cells from both human bone marrow and peripheral blood, covering 16 distinct immune cell types. We used the version provided by scGPT (download from https://figshare.com/ndownloader/files/25717328), with original cell type annotations from Luecken et al. 2022.
Cancer Infiltrating Myeloid: This dataset comprises tumor-infiltrating myeloid cells (TMIs) from nine different human cancer types. TMIs are key regulators of tumor progression, with properties and functions that vary across cancer types. The dataset is available at: https://drive.google.com/drive/folders/1VbpApQufZq8efFGakW3y8QDDpY9MBoDS.
Held Out Non-Immune Cells: This dataset consists of 10 non-immune cell types that our model did not see during pre-training and validation. The dataset was sourced from the CELLxGENE corpus (available at: https://cellxgene.cziscience.com/collections). More specifically, the selected cell types include retinal rod, epithelial, perivascular, amacrine, respiratory basal, smooth muscle, neural progenitor, secretory, myofibroblast, and megakaryocyte, which represent the cell types with the largest number of cells passing QC. Refer to the preprint for details.

Evaluation Results

See preprint for details.

Biases, Risks, and Limitations

Potential Biases

Embedding quality depends on gene regulatory network structure and quality

Risks

Areas of risk may include but are not limited to:

Inaccurate outputs or hallucinations
Potential misuse for incorrect biological interpretations

Limitations

We use the graph structure as inductive bias to reduce the learning space. Therefore, the graph structure needs to remain static during training and evaluation.
GREmLN's embedding quality is sensitive to network quality.
There are spectral constraints in the current framework because the graph adjacency matrix needs to be symmetric.
See preprint for more details about model limitations and solutions planned for future versions.

Caveats and Recommendations

GREmLN currently represents an early release of the pre-training architecture.
Review and validate outputs generated by the model.
We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.
Should you have any security or privacy issues or questions related to this model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

Acknowledgements

The model is a collective effort by researchers at Columbia University, CZ Biohub NY, and CZI. We acknowledge efforts from Vinay Swamy, Rowan Cassius, Léo Dupire, and Evan Paull who are main contributors of the code base and validation experiments. We are thankful for guidance from Drs. Theo Karaletsos and Andrea Califano who are the senior authors of the model.

If you have recommendations for this model card please contact virtualcellmodels@chanzuckerberg.com.