Rbio

Version v1.0 released 20 Aug 2025

License

License

Developed By

Ana-Maria Istrate, Fausto Milletari, Fabrizio Castrotorres, Jakub Tomczak, Michaela Torkar, Donghui Li, Theofanis Karaletsos, (Chan Zuckerberg Initiative)

Rbio is a conversational large language model designed to perform biological reasoning and make scientific knowledge more accessible. It is built upon the Qwen transformer architecture and is post-trained using reinforcement learning (GRPO) with a novel 'soft supervision' paradigm. This approach uses world models of biology (e.g., TranscriptFormer), knowledge sources such as the Gene Ontology, or experimental data as verifiers to ground the model's reasoning in established biology, allowing it to answer questions and explore hypotheses through natural language.

Model Details

Finetuned From Model

Built with Qwen2.5-3B-Instruct which is licensed under the Qwen research license.

Model Architecture

Rbio uses a standard decoder-only transformer architecture from its Qwen base model. The innovation lies in the training methodology, which employs reinforcement learning (specifically GRPO) and specialized verifiers to guide the rewards during RL training.

Parameters

The model is built on Qwen2.5-3B-Instruct, which has 3 billion parameters.

Citation

Istrate et al. rbio-1 - training scientific reasoning LLMs with biological world models as soft verifiers (2025) bioRxiv. DOI: https://doi.org/10.1101/2025.08.18.670981

Model Card Authors

Ana-Maria Istrate, Fausto Milletari, Fabrizio Castrotorres, Jakub Tomczak, Michaela Torkar, Donghui Li, Theofanis Karaletsos (Chan Zuckerberg Initiative)

Primary Contact Email

virtualcellmodels@chanzuckerberg.com

System requirements

May be deployable on consumer-grade GPUs.

Hardware requirements for inference depend on the model variant and desired precision (quantization). The following are general estimates:

Rbio (3B variant):

  • Full Precision (FP16/BF16): Requires a GPU with at least 10-12 GB of VRAM (e.g., NVIDIA RTX 3060 12GB, RTX 4070).
  • Quantized (4-bit/8-bit): Can run on GPUs with 6-8 GB of VRAM. CPU inference is also feasible with sufficient system RAM (16 GB+ recommended).

Model Variants

Rbio includes several variants based on the type of data or model used as a verifier during reinforcement learning.

Model Variant Name
Task, Purpose, or Description
Access URL or AWS Download link
Rbio1-EXPPost-trained using direct experimental data as a "hard verifier" for maximum accuracy on related tasks.s3://czi-rbio/rbio1-EXP/
Rbio1-MLPPost-trained using a task-specific MLP as a "soft verifier", demonstrating knowledge transfer from a smaller world model.s3://czi-rbio/rbio1-MLP/
Rbio1-TFPost-trained using signals (e.g., PMI scores) from the Transcriptformer foundation model as a "soft verifier".s3://czi-rbio/rbio1-TF/
Rbio1-GOPost-trained using the Gene Ontology (GO) knowledge base as a “soft verifier” guiding the model with established biological facts.s3://czi-rbio/rbio1-GO/
Rbio1-GO-CPost-trained using the Gene Ontology (GO) knowledge base as a “soft verifier” guiding the model with established biological facts via Rouge-C metric.s3://czi-rbio/rbio1-GO-C/
Rbio1-GO-FPost-trained using the Gene Ontology (GO) knowledge base as a “soft verifier” guiding the model with established biological facts via Rouge-F metric.s3://czi-rbio/rbio1-GO-F/
Rbio1-GO+EXPPost-trained using both experimental data acting as a “hard verifier” on the task at hand and Gene Ontology (GO) knowledge base as a “soft verifier” for biological facts consistency.s3://czi-rbio/rbio1-GO+EXP/
Rbio1-TF+EXPPost-trained using both experimental data acting as a “hard verifier” on the task at hand and Transcriptformer foundation model as “soft verifier” using PMI scores.s3://czi-rbio/rbio1-TF+EXP/
Rbio1-TF+GO+EXPPost-trained using: experimental data acting as a “hard verifier” on the task at hand; Transcriptformer foundation model as “soft verifier” using PMI scores; and Gene Ontology (GO) knowledge base as a “soft verifier” for biological facts consistency.s3://czi-rbio/rbio1-TF+GO+EXP/
Rbio1-TF+GO+MLPPost-trained using: an MLP acting as a "soft verifier” of world-knowledge as seen through the lens of a smaller model; Transcriptformer foundation model as “soft verifier” using PMI scores; and Gene Ontology (GO) knowledge base as a “soft verifier” for biological facts consistency.s3://czi-rbio/rbio1-TF+GO+MLP/
Rbio1-TF+GO+MLP+EXPPost-trained using: experimental data acting as a “hard verifier” on the task at hand; Transcriptformer foundation model as “soft verifier” using PMI scores; Gene Ontology (GO) knowledge base as a “soft verifier” for biological facts consistency; and MLP as a "soft verifier” of world-knowledge as rendered via a smaller model.s3://czi-rbio/rbio1-TF+GO+MLP+EXP/

Intended Use

Primary Use Cases:

  • Perturbation prediction (e.g., predicting the effect of one gene on another).
  • Hypothesis generation for scientific research.
  • Interactive exploration of biological knowledge through conversational dialogue.

Out-of-Scope or Unauthorized Use Cases:

Do not use the model for the following purposes:

  • Use that violates applicable laws, regulations (including trade compliance laws), or third-party rights such as privacy or intellectual property rights.
  • Any use that is prohibited by the license.
  • Any use that is prohibited by the Acceptable Use Policy.
  • Making clinical diagnoses or providing treatment recommendations. The model is intended for research and informational purposes only.

Training Details

Training Date

June - August, 2025

Training Data

The model was trained using a combination of data sources:

  • Hard Verification Data: Experimental data from multiple single-cell perturbation screening datasets, including cell lines K562, RPE1, HEPG2, and Jurkat from the PerturbQA benchmark (arXiv:2502.21290 [cs.AI]).
  • Soft Verification Sources: Knowledge was distilled from other biological models and databases, including pointwise mutual information (PMI) scores from TranscriptFormer, embeddings from ESM, and structured knowledge from the Gene Ontology (GO) knowledgebase.

Training Procedure

The model was post-trained from a pre-trained Qwen2.5-3b-Instruct LLM using reinforcement learning, specifically the GRPO (Grafted Reward Policy Optimization) algorithm. The core of the training is a novel verification mechanism that provides the reward signal. In "hard" verification, the reward is based directly on experimental outcomes. In "soft" verification, the reward is a score generated by another model (e.g., an MLP or Transcriptformer) or knowledge base (e.g., GO), which evaluates the plausibility of the LLM's generated reasoning trace.

Training Code

https://github.com/czi-ai/rbio

Speeds, Sizes, Times

Each model version took ~10 days for completion on 8 H100 GPUs, with some variation between models.

Training Hyperparameters

Each model has been trained for 100k steps on H100 GPUs, with a batch_size = 4, n_generation = 4 and a default learning rate of 5e-6.

Data Sources

  • Cell line perturbation data for K562, RPE1, HEPG2, and Jurkat from the PerturbQA benchmark.
  • Gene Ontology (GO) knowledgebase
  • Knowledge distilled from foundation models such as TranscriptFormer and ESM.

Performance Metrics

Metrics

The model was evaluated on perturbation prediction tasks using a comprehensive suite of classification metrics to assess its reasoning capabilities:

  • Accuracy, Precision, Recall (TPR), F1-Score: To measure overall correctness and the trade-offs between finding true positives and avoiding false positives.
  • Specificity (TNR): To measure the ability to correctly identify true negatives.
  • Baselines: Performance was compared against instruction-tuned baseline models, including Qwen2.5-3B and other state-of-the-art models for perturbation prediction, such as SUMMER and GEARS.

Evaluation Datasets

Held-out test splits from the K562, RPE1, HEPG2, and Jurkat cell-line datasets.

Evaluation Results

  • rbio models trained with soft verification on tasks like perturbation generalize on out-of-distribution perturbation datasets, showing that task simulation trains competitive models with hard experimental data
  • Virtual Cell Models teach rbio off-task biology exhibiting transfer to perturbation
  • Models trained on combinations of biological verifiers improve generalization
  • Chain-Of-Thought prompting at test-time elevates rbio models to state-of-the art on PerturbQA

Further details are described in the preprint https://doi.org/10.1101/2025.08.18.670981.

Biases, Risks, and Limitations

Potential Biases

  • The model may reflect biases present in the training data. Performance is likely to be highest on tasks and cell types similar to those seen during training (K562, RPE1, HEPG2, Jurkat).
  • The model's knowledge is limited to the information contained in its training sources (e.g., Gene Ontology). It may not be aware of very recent discoveries or data from under-represented biological domains.
  • Certain demographic groups may be underrepresented in the genomic datasets used for training foundation models, which could translate to biases in downstream reasoning.

Risks

  • Inaccurate outputs or hallucinations: Like all LLMs, the model can generate outputs that sound plausible but are factually incorrect.
  • Potential misuse for incorrect biological interpretations: The model's outputs could be misinterpreted as confirmed facts, leading to flawed experimental designs or incorrect scientific conclusions if not properly validated.

Limitations

  • The model's performance on biological reasoning tasks outside of perturbation prediction has not thoroughly been evaluated.
  • The model's reasoning is grounded in its training data; it does not perform live experiments or access real-time information.
  • rbio-1 offers a powerful alternative to models trained on experimental data; however model outputs should be viewed as "predictions", subject to experimental validation
  • While the model might output the correct final answer, the reasoning traces might not reveal the full justification - i.e. the reasoning traces will not always be aligned with the final answer.

Caveats and Recommendations

  • Always review and validate outputs generated by the model.
  • Treat model outputs as machine-generated hypotheses that require further experimental validation, not as established biological facts.
  • We are committed to advancing the responsible development and use of artificial intelligence. Please follow our Acceptable Use Policy when using the model.

Should you have any security or privacy issues or questions related to this model, please reach out to our team at security@chanzuckerberg.com or privacy@chanzuckerberg.com respectively.

Acknowledgements

The authors wish to thank Maximilian Lombardo and Wyatt Robarts (compiling and reviewing the model card), Omar Valenzuela (assisting with open sourcing the models).