---
title: "Reference Conditioning"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Reference Conditioning}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE # Set to FALSE since API calls require credentials
)
```

## Overview

Reference conditioning models generate expression data **conditioned on a real reference sample**. This allows you to "anchor" to an existing expression profile while applying perturbations or modifications.

This is useful when you want to:

- Simulate the effect of a perturbation on a specific sample
- Generate expression profiles that preserve the biological and technical characteristics of a reference
- Create synthetic "treated vs. control" pairs

## Available Models

- **`gem-1-bulk_reference-conditioning`**: Bulk RNA-seq reference conditioning model
- **`gem-1-sc_reference-conditioning`**: Single-cell RNA-seq reference conditioning model

> **Note:** These endpoints may require 1-2 minutes of startup time if they have been scaled down. Plan accordingly for interactive use.

```{r}
library(rsynthbio)
```

## How It Works

Reference conditioning encodes the biological and technical characteristics from a real expression sample, then generates new expression data that:

1. Preserves the biological/technical latent space of the reference
2. Applies any perturbation metadata you specify
3. Returns synthetic expression that reflects the perturbation effect on that specific sample

## Creating a Query

Reference conditioning queries require different inputs than baseline models:

```{r query-example, eval=FALSE}
# Get the example query structure
example_query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query

# Inspect the query structure
str(example_query)
```

The query structure includes:

1. **`inputs`**: A list where each input contains:
   - **`counts`**: The reference expression counts (a named list with a `counts` vector)
   - **`metadata`**: Perturbation-only metadata (see below)
   - **`num_samples`**: How many samples to generate

2. **`conditioning`**: Which latent spaces to condition on (typically `["biological", "technical"]`)

3. **`sampling_strategy`**: `"mean estimation"` or `"sample generation"`

### Perturbation-Only Metadata

Unlike baseline models, reference conditioning queries only accept perturbation metadata fields:

- `perturbation_ontology_id`
- `perturbation_type`
- `perturbation_time`
- `perturbation_dose`

All other biological and technical metadata is inferred from the reference expression.

## Example: Simulating a Drug Treatment

Here's a complete example simulating a drug treatment effect on a reference sample:

```{r drug-treatment, eval=FALSE}
# Start with example query structure
query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query

# Replace with your actual reference counts
# The counts vector must match the model's expected gene order and length
query$inputs[[1]]$counts <- list(counts = your_reference_counts)

# Specify the perturbation
query$inputs[[1]]$metadata <- list(
  perturbation_ontology_id = "CHEMBL25", # Aspirin (ChEMBL ID)
  perturbation_type = "compound",
  perturbation_time = "24h",
  perturbation_dose = "10uM"
)

query$inputs[[1]]$num_samples <- 3

# Set the sampling strategy
query$sampling_strategy <- "mean estimation"

# Submit the query
result <- predict_query(query, model_id = "gem-1-bulk_reference-conditioning")
```

## Example: CRISPR Knockout Simulation

Simulate the effect of knocking out a specific gene:

```{r crispr-example, eval=FALSE}
query <- get_example_query(model_id = "gem-1-bulk_reference-conditioning")$example_query

# Your reference sample counts
query$inputs[[1]]$counts <- list(counts = control_sample_counts)

# CRISPR knockout of TP53
query$inputs[[1]]$metadata <- list(
  perturbation_ontology_id = "ENSG00000141510", # TP53 Ensembl ID
  perturbation_type = "crispr"
)

query$inputs[[1]]$num_samples <- 5

result <- predict_query(query, model_id = "gem-1-bulk_reference-conditioning")
```

## Query Parameters

### conditioning (list, optional)

Controls which latent spaces are conditioned on the reference. Default is `["biological", "technical"]`.

When both are conditioned, the model preserves both biological identity and technical characteristics from the reference sample.

### sampling_strategy (character, required)

Controls the type of prediction:

- **"sample generation"**: Generates realistic-looking synthetic data with measurement error. **(Bulk only)**
- **"mean estimation"**: Provides stable mean estimates. **(Bulk and single-cell)**

```{r mode-example, eval=FALSE}
query$sampling_strategy <- "mean estimation"
```

### fixed_total_count (logical, optional)

Controls whether to preserve the reference's library size:

- **`FALSE`** (default): The output's total count is taken from the reference expression (sum of its counts). Use this when you want the synthetic sample to preserve the reference's library size.
- **`TRUE`**: Forces the model to use the `total_count` parameter value (or default) instead of the reference's library size.

```{r fixed-total-count, eval=FALSE}
# Preserve reference library size (default)
query$fixed_total_count <- FALSE

# Or force a specific library size
query$fixed_total_count <- TRUE
query$total_count <- 10000000
```

### total_count (integer, optional)

Library size used when converting predicted log CPM back to raw counts. Only effective when `fixed_total_count = TRUE`.

- Default: 10,000,000 for bulk; 10,000 for single-cell

### deterministic_latents (logical, optional)

If `TRUE`, the model uses the mean of each latent distribution (`p(z|metadata)` for perturbation, `q(z|x)` for conditioned components) instead of sampling. This produces deterministic, reproducible outputs.

- Default: `FALSE`

```{r deterministic-example, eval=FALSE}
query$deterministic_latents <- TRUE
```

### seed (integer, optional)

Random seed for reproducibility.

```{r seed-example, eval=FALSE}
query$seed <- 42
```

## Valid Perturbation Metadata

| Field                      | Description / Format                                                                                                                                                                  |
|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `perturbation_ontology_id` | Ensembl gene ID (e.g., `ENSG00000141510`), [ChEBI ID](https://www.ebi.ac.uk/chebi/), [ChEMBL ID](https://www.ebi.ac.uk/chembl/), or [NCBI Taxonomy ID](https://www.ncbi.nlm.nih.gov/taxonomy) |
| `perturbation_type`        | One of: "coculture", "compound", "control", "crispr", "genetic", "infection", "other", "overexpression", "peptide or biologic", "shrna", "sirna"                                      |
| `perturbation_time`        | Time since perturbation (e.g., "24h", "48h")                                                                                                                                          |
| `perturbation_dose`        | Dose of perturbation (e.g., "10uM", "1mg/kg")                                                                                                                                         |

## Working with Results

The result structure is similar to baseline models:

```{r results, eval=FALSE}
# Access metadata and expression matrices
metadata <- result$metadata
expression <- result$expression

# Compare to your reference
dim(expression)
head(metadata)
```

### Differential Expression

When conditioning on both biological and technical latents, you can directly compare the generated expression to your reference to identify perturbation effects:
```{r de-analysis, eval=FALSE}
# Your reference (input) counts
reference_cpm <- your_reference_counts / sum(your_reference_counts) * 1e6

# Generated (perturbed) counts
generated_cpm <- expression[1, ] / sum(expression[1, ]) * 1e6

# Log fold change
log2fc <- log2(generated_cpm + 1) - log2(reference_cpm + 1)

# Identify top changed genes
head(sort(log2fc, decreasing = TRUE), 20)
```

## Important Notes

### Counts Vector Length

The reference counts vector must match the model's expected number of genes. If the length doesn't match, the API will return a validation error.

Use `get_example_query()` to see the expected structure and ensure your counts vector has the correct length.

### Gene Order

Ensure your reference counts are in the same gene order expected by the model. The response includes a `gene_order` field that specifies the expected order.