Quickstart: AIDO.Cell

Estimated time to complete: under 10 minutes (A100 GPU system)

Google Colab Note: This notebook requires A100 GPU only included with Google Colab Pro or Enterprise paid services. Alternatively, a "pay as you go" option is available to purchase premium GPUs. See Colab Service Plans for details.

Learning Goals

Install ModelGenerator, a plug-and-play framework for using AIDO.Cell models
Download a single-cell RNA dataset from the Gene Expression Omnibus (GEO) repository
Preprocess data
Generate embeddings using the pre-trained AIDO.Cell-3M model

Pre-requisites

A100 GPU or equivalent
Python=3.10 or Python=3.11

Introduction

Model

The AIDO.Cell models are a family of scalable transformer-based models that were trained on 50 million cells spanning a diverse set of human tissues and organs. The models aim to learn accurate and general representations of the human cell's entire transcriptional context and can be used for various tasks including zero-shot clustering, cell type classification, and perturbation modeling. This quickstart implements AIDO.Cell-3M, the smallest variant of the AIDO.Cell models, to embedd single-cell RNA data.

AIDO.Cell was designed for use with the ModelGenerator CLI. It is strongly recommended to use ModelGenerator for running AIDO.Cell models. For more information, check out

Example Dataset

The GEO dataset used in this quickstart includes single-cell RNA data obtained from colon biopsies collected from patients with ulcerative colatis (UC) and Chron's disease (CD). The dataset also includes samples from a healthy control (HC).

Setup

The steps below will install the required ModelGenerator package and associated dependencies and download the example dataset and model checkpoint. It may take a few minutes to download all the files.

Setup Google Colab

To run this quickstart using Google Colab, you will need to choose the 'A100' GPU runtime from the "Connect" dropdown menu in the upper-right corner of this notebook. Note that this runtime configuration is not available in the free Colab tier. To access premium GPUs, you will need to purchase additional compute units. The current quickstart was tested in Colab Enterprise using the following runtime configuration:

Machine type: a2-highgpu-1g
GPU type: NVIDIA_TESLA_A100 x 1
Data disk type:100 GB Standard Disk (pd-standard)

Setup Local Environment

ModelGenerator is an open-source and convenient plug-and-play software stack to run AIDO.Cell moldels. It automatically interfaces with Hugging Face and allows easy one-command embedding and adaptation of the models for a wide variety of fine-tuning tasks. To run ModelGenerator, the GPU must be ampere-generation or later to support flash attention (e.g., A100, H100).

Step 1: Install ModelGenerator and required dependencies

!git clone https://github.com/genbio-ai/ModelGenerator.git
%cd ModelGenerator
!pip install -e ".[flash_attn]"
!pip install -r experiments/AIDO.Cell/requirements.txt
%cd experiments/AIDO.Cell

Restart the session after installing, then navigate back to the AIDO.Cell directory:

%cd ModelGenerator/experiments/AIDO.Cell

Step 2: Download example dataset from GEO and load into anndata

%%bash
mkdir -p data
cd data
wget -nv -O GSE214695.tar 'http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE214695&format=file'
tar -xvf GSE214695.tar
cd ..

import anndata as ad
import scanpy as sc

adata = sc.read_10x_mtx('data', prefix='GSM6614348_HC-1_')
sc.pp.filter_cells(adata, min_genes=500)
sc.pp.filter_genes(adata, min_cells=3)
# No more normalization needed, AIDO.Cell uses raw counts

Step 3: Preprocess the anndata for AIDO.Cell

import cell_utils
aligned_adata, attention_mask = cell_utils.align_adata(adata)

Step 4: Generate AIDO.Cell embeddings

# Embed
import anndata as ad
import numpy as np
import torch
import sys
from modelgenerator.tasks import Embed

# The following is equivalent to the ModelGenerator CLI command:
# mgen predict --model Embed --model.backbone aido_cell_3m \
#   --data CellClassificationDataModule --data.test_split_files <your_anndata>.h5ad

# If not using mgen, this should be configured manually.

device = 'cuda'
batch_size = 2

model = Embed.from_config({
        "model.backbone": "aido_cell_3m",
        "model.batch_size": batch_size
    }).eval()
model = model.to(device).to(torch.float16)

# All data must be in bfloat16
batch_np = aligned_adata[:batch_size].X.toarray()
batch_tensor = torch.from_numpy(batch_np).to(torch.bfloat16).to(device)

# Call transform and embed.
batch_transformed = model.transform({'sequences': batch_tensor})
embs = model(batch_transformed)

# Full Embeddings
print('FULL EMBEDDING')
print('(batch_size, genes, embedding_dim)')
print(embs.shape)
print(embs)
print('-------------------------------------')

# Non-Zero Genes Embeddings
print('NON-ZERO GENES EMBEDDING')
embs = embs[:, attention_mask.astype(bool), :]
print('(batch_size, genes, embedding_dim)')
print(embs.shape)
print(embs)

Output:

    FULL EMBEDDING
    (batch_size, genes, embedding_dim)
    torch.Size([2, 19264, 128])
    tensor([[[-2.0430,  0.4229, -1.6641,  ..., -0.9346,  0.3691,  1.6074],
            [-0.6450, -1.9004, -2.7969,  ..., -1.5557,  0.9419, -0.5210],
            [-1.0693, -1.5303, -0.9526,  ..., -0.6470,  0.6484,  0.8975],
            ...,
            [ 0.5708, -1.8574, -2.6406,  ..., -0.3594, -0.2087,  0.9453],
            [ 0.0121,  0.0419,  0.3096,  ..., -0.4370,  1.3516, -0.4097],
            [-1.1113, -1.5303, -1.0635,  ..., -1.0801,  1.4648, -0.9688]],

            [[-2.2988,  1.0430, -2.3164,  ..., -0.2478,  0.5171,  0.1464],
            [-0.8042, -1.9922, -2.7480,  ..., -1.4678,  0.6299, -0.7510],
            [-0.0687, -2.2207, -0.0922,  ..., -1.4395,  0.0156,  0.8447],
            ...,
            [ 0.0627, -1.3369, -2.4355,  ..., -0.0134,  0.0335,  1.0449],
            [ 0.1595,  0.0429,  0.3174,  ..., -0.1583,  1.0918, -0.3188],
            [-0.6709, -1.0010, -1.5508,  ..., -1.0186,  0.9917, -0.7573]]],
        device='cuda:0', dtype=torch.float16, grad_fn=<SliceBackward0>)
    -------------------------------------
    NON-ZERO GENES EMBEDDING
    (batch_size, genes, embedding_dim)
    torch.Size([2, 13427, 128])
    tensor([[[-2.0430,  0.4229, -1.6641,  ..., -0.9346,  0.3691,  1.6074],
            [-0.6450, -1.9004, -2.7969,  ..., -1.5557,  0.9419, -0.5210],
            [-1.0693, -1.5303, -0.9526,  ..., -0.6470,  0.6484,  0.8975],
            ...,
            [ 0.0761, -0.1423, -1.4922,  ..., -1.0195,  1.3799,  0.8159],
            [ 0.5708, -1.8574, -2.6406,  ..., -0.3594, -0.2087,  0.9453],
            [ 0.0121,  0.0419,  0.3096,  ..., -0.4370,  1.3516, -0.4097]],

            [[-2.2988,  1.0430, -2.3164,  ..., -0.2478,  0.5171,  0.1464],
            [-0.8042, -1.9922, -2.7480,  ..., -1.4678,  0.6299, -0.7510],
            [-0.0687, -2.2207, -0.0922,  ..., -1.4395,  0.0156,  0.8447],
            ...,
            [-0.0815, -0.3008, -1.0361,  ..., -0.9136,  1.6484,  0.5752],
            [ 0.0627, -1.3369, -2.4355,  ..., -0.0134,  0.0335,  1.0449],
            [ 0.1595,  0.0429,  0.3174,  ..., -0.1583,  1.0918, -0.3188]]],
        device='cuda:0', dtype=torch.float16, grad_fn=<IndexBackward0>)

Contacts and Acknowledgements

For issues with this tutorial please contact virtualcellmodels@chanzuckerberg.com or Caleb Ellington at caleb.ellington@genbio.ai.

Thanks to Caleb Ellington, all the AIDO.Cell model developers and the GenBio AI team for creating and supporting this resource.

Responsible Use

We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when engaging with our services.

Run Notebook