--- title: "Reference Conditioning" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Reference Conditioning} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE # Set to FALSE since API calls require credentials ) ``` ## Overview Reference conditioning models generate expression data **conditioned on a real reference sample**. This allows you to "anchor" to an existing expression profile while applying perturbations or modifications. This is useful when you want to: - Simulate the effect of a perturbation on a specific sample - Generate expression profiles that preserve the biological and technical characteristics of a reference - Create synthetic "treated vs. control" pairs ## Available Models - **`gem-1-bulk_reference-conditioning`**: Bulk RNA-seq reference conditioning model - **`gem-1-sc_reference-conditioning`**: Single-cell RNA-seq reference conditioning model > **Note:** These endpoints may require 1-2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use. ```{r} library(rsynthbio) ``` ## How It Works Reference conditioning encodes the biological and technical characteristics from a real expression sample, then generates new expression data that: 1. Preserves the biological/technical latent space of the reference 2. Applies any perturbation metadata you specify 3. Returns synthetic expression that reflects the perturbation effect on that specific sample ## Creating a Query Reference conditioning queries require different inputs than baseline models: ```{r query-example, eval=FALSE} # Get the example query structure example_query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query # Inspect the query structure str(example_query) ``` The query structure includes: 1. **`inputs`**: A list where each input contains: - **`counts`**: The reference expression counts (a named list with a `counts` vector) - **`metadata`**: Perturbation-only metadata (see below) - **`num_samples`**: How many samples to generate 2. **`conditioning`**: Which latent spaces to condition on (typically `["biological", "technical"]`) 3. **`sampling_strategy`**: `"mean estimation"` or `"sample generation"` ### Perturbation-Only Metadata Unlike baseline models, reference conditioning queries only accept perturbation metadata fields: - `perturbation_ontology_id` - `perturbation_type` - `perturbation_time` - `perturbation_dose` All other biological and technical metadata is inferred from the reference expression. ## Example: Simulating a Drug Treatment Here's a complete example simulating a drug treatment effect on a reference sample: ```{r drug-treatment, eval=FALSE} # Start with example query structure query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query # Replace with your actual reference counts # The counts vector must match the model's expected gene order and length query$inputs[[1]]$counts <- list(counts = your_reference_counts) # Specify the perturbation query$inputs[[1]]$metadata <- list( perturbation_ontology_id = "CHEMBL25", # Aspirin (ChEMBL ID) perturbation_type = "compound", perturbation_time = "24h", perturbation_dose = "10uM" ) query$inputs[[1]]$num_samples <- 3 # Set the sampling strategy query$sampling_strategy <- "mean estimation" # Submit the query result <- predict_query(query, model_id = "gem-1-bulk_reference-conditioning") ``` ## Example: CRISPR Knockout Simulation Simulate the effect of knocking out a specific gene: ```{r crispr-example, eval=FALSE} query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query # Your reference sample counts query$inputs[[1]]$counts <- list(counts = control_sample_counts) # CRISPR knockout of TP53 query$inputs[[1]]$metadata <- list( perturbation_ontology_id = "ENSG00000141510", # TP53 Ensembl ID perturbation_type = "crispr" ) query$inputs[[1]]$num_samples <- 5 result <- predict_query(query, model_id = "gem-1-bulk_reference-conditioning") ``` ## Query Parameters ### conditioning (list, optional) Controls which latent spaces are conditioned on the reference. Default is `["biological", "technical"]`. When both are conditioned, the model preserves both biological identity and technical characteristics from the reference sample. ### sampling_strategy (character, required) Controls the type of prediction: - **"sample generation"**: Generates realistic-looking synthetic data with measurement error. **(Bulk only)** - **"mean estimation"**: Provides stable mean estimates. **(Bulk and single-cell)** ```{r mode-example, eval=FALSE} query$sampling_strategy <- "mean estimation" ``` ### fixed_total_count (logical, optional) Controls whether to preserve the reference's library size: - **`FALSE`** (default): The output's total count is taken from the reference expression (sum of its counts). Use this when you want the synthetic sample to preserve the reference's library size. - **`TRUE`**: Forces the model to use the `total_count` parameter value (or default) instead of the reference's library size. ```{r fixed-total-count, eval=FALSE} # Preserve reference library size (default) query$fixed_total_count <- FALSE # Or force a specific library size query$fixed_total_count <- TRUE query$total_count <- 10000000 ``` ### total_count (integer, optional) Library size used when converting predicted log CPM back to raw counts. Only effective when `fixed_total_count = TRUE`. - Default: 10,000,000 for bulk; 10,000 for single-cell ### deterministic_latents (logical, optional) If `TRUE`, the model uses the mean of each latent distribution (`p(z|metadata)` for perturbation, `q(z|x)` for conditioned components) instead of sampling. This produces deterministic, reproducible outputs. - Default: `FALSE` ```{r deterministic-example, eval=FALSE} query$deterministic_latents <- TRUE ``` ### seed (integer, optional) Random seed for reproducibility. ```{r seed-example, eval=FALSE} query$seed <- 42 ``` ## Valid Perturbation Metadata | Field | Description / Format | |----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `perturbation_ontology_id` | Ensembl gene ID (e.g., `ENSG00000141510`), [ChEBI ID](https://www.ebi.ac.uk/chebi/), [ChEMBL ID](https://www.ebi.ac.uk/chembl/), or [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) | | `perturbation_type` | One of: "coculture", "compound", "control", "crispr", "genetic", "infection", "other", "overexpression", "peptide or biologic", "shrna", "sirna" | | `perturbation_time` | Time since perturbation (e.g., "24h", "48h") | | `perturbation_dose` | Dose of perturbation (e.g., "10uM", "1mg/kg") | ## Working with Results The result structure is similar to baseline models: ```{r results, eval=FALSE} # Access metadata and expression matrices metadata <- result$metadata expression <- result$expression # Compare to your reference dim(expression) head(metadata) ``` ### Differential Expression When conditioning on both biological and technical latents, you can directly compare the generated expression to your reference to identify perturbation effects: ```{r de-analysis, eval=FALSE} # Your reference (input) counts reference_cpm <- your_reference_counts / sum(your_reference_counts) * 1e6 # Generated (perturbed) counts generated_cpm <- expression[1, ] / sum(expression[1, ]) * 1e6 # Log fold change log2fc <- log2(generated_cpm + 1) - log2(reference_cpm + 1) # Identify top changed genes head(sort(log2fc, decreasing = TRUE), 20) ``` ## Important Notes ### Counts Vector Length The reference counts vector must match the model's expected number of genes. If the length doesn't match, the API will return a validation error. Use `get_example_query()` to see the expected structure and ensure your counts vector has the correct length. ### Gene Order Ensure your reference counts are in the same gene order expected by the model. The response includes a `gene_order` field that specifies the expected order.