---
title: "Simulating data with cellmig"
author: "Simo Kitanovski (simo.kitanovski@uni-due.de)"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{User manual: data simulation}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include = FALSE, warning = FALSE}
knitr::opts_chunk$set(comment = FALSE, 
                      message = FALSE)
```

```{r}
library(cellmig)
library(ggplot2)
library(ggforce)
library(patchwork)
library(rstan)

ggplot2::theme_set(new = theme_bw(base_size = 10))
```

# Background

High-throughput cell migration assays generate complex, hierarchically
structured datasets with inherent technical noise and biological
variability. To accurately quantify treatment effects on cell migration
velocity, it is essential to account for these sources of variation and
to design experiments that provide sufficient statistical power. The
`r Biocpkg("cellmig")` R package implements a Bayesian hierarchical model 
for analyzing such data. See the "User manual" vignette to understand the
key functionalities of `r Biocpkg("cellmig")`.

Furthermore, `r Biocpkg("cellmig")` includes functionality for simulating
synthetic datasets. These simulations are invaluable for experimental
planning, allowing researchers to assess how different experimental
designs (e.g., varying numbers of biological replicates, technical
replicates, or cells per well) affect the precision of treatment effect
estimates.

This vignette demonstrates how to use `r Biocpkg("cellmig")` to simulate 
cell migration data under a partially generative model. These functionalities
are documented in this vignette. In short, we simulate data based on 
realistic parameter values derived from experimental data and
then analyze the simulated data using `r Biocpkg("cellmig")` to recover the
parameters. This process validates the model and illustrates how
simulation can guide experimental design and power analysis.


# Installation
To install this package, start R (version "4.5") and enter:

```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("cellmig")
```

# Data simulation from **partially generative** model

We simulate data for an experiment with the following design:

-   Six biological replicates (plates), indexed by $p = 1, 2, \dots, 6$.

-   Three technical replicates (wells) per treatment per plate, indexed
    by $t = 1, 2, 3$.

-   30 cells per well.

-   Seven treatment groups ($t = \{1, \dots, 7\}$) with fixed effects
    $\delta_t = [-0.4, -0.2, -0.1, 0, 0.1, 0.2, 0.4]$.

-   Biological replicate variability: $\sigma_{\text{bio}} = 0.1$.

-   Technical replicate variability: $\sigma_{\text{tech}} = 0.05$.

## Model specification

### Setting $\alpha_p$

Plate-specific batch effects are modeled by drawing the plate offset
$\alpha_p$ from a normal distribution:

$$
\alpha_p \sim \text{Normal}(1.7, 0.1)
$$

Here, $\alpha_p$ represents the log-transformed mean velocity of the
control treatment on plate $p$. The mean (1.7) and standard deviation
(0.1) are chosen based on experimental data. If batch effects are
minimal, the standard deviation can be set to a smaller value (e.g.,
0.01).

### Setting $\kappa_w$

The velocities in each well are gamma distributed, and the gamma
distribution is defined by a rate and shape parameter. The shape
parameter for well $w$, $\kappa_w$, controls the form or skewness of the
distribution and, to some extent, the variability in the data.

We simulate the values of $\log(\kappa_w)$ by drawing randomly from a
normal distribution with mean ($\mu_{\kappa}$) and standard deviation
(${\sigma}_{\kappa}$). The values of $\mu_{\kappa}$ and
${\sigma}_{\kappa}$ can either be fixed to point values, or be drawn
from two normal distributions whose means and standard deviations are
described by the inputs `prior_kappa_mu_M`, `prior_kappa_sigma_M`,
`prior_kappa_mu_SD`, and `prior_kappa_sigma_SD`. To fix the values of
$\mu_{\kappa}$ and ${\sigma}_{\kappa}$ to point estimates set the scales
to small values (e.g., set in `control` `prior_kappa_mu_SD`=0.01 and
`prior_kappa_sigma_SD`).

### Defining the offset

To correct for batch effects across plates, we designate one treatment
as the reference (offset). Here, we set the offset to treatment 4 (i.e.,
the fourth treatment group, which has $\delta_t = 0$).

## Generates dataframe with simulated velocities

We use the `gen_partial` function from `r Biocpkg("cellmig")` to simulate the 
data. The function requires a control list specifying the experimental design
and model parameters.

```{r}
# set seed for reproducible results
set.seed(seed = 1253)
# simulate data
y_p <- gen_partial(control = list(N_biorep = 8, 
                                  N_techrep = 5, 
                                  N_cell = 50, 
                                  delta = c(-0.4, -0.2, -0.1, 0, 0.1, 0.2, 0.4),
                                  sigma_bio = 0.1, 
                                  sigma_tech = 0.05, 
                                  offset = 4,
                                  prior_alpha_p_M = 1.7,
                                  prior_alpha_p_SD = 0.1,
                                  prior_kappa_mu_M = 1.7,
                                  prior_kappa_mu_SD = 0.1,
                                  prior_kappa_sigma_M = 0,
                                  prior_kappa_sigma_SD = 0.1))
```

```{r}
# see data structure
str(y_p$y)
```

The output is a dataframe with columns:

-   `v`: Cell velocity

-   `well`: Well identifier

-   `plate`: Plate identifier

-   `group`: Treatment group

-   `compound`: Compound identifier (same as group in this simulation)

-   `dose`: Dose level (not used in this simulation, set to "X")

**Visualizing the simulated data**

We can visualize the simulated cell velocities using scatter plots, with
points jittered by well and colored by plate.

```{r, fig.width=7, fig.height=3}
ggplot(data = y_p$y)+
  geom_sina(aes(x = paste0("t=", group), col = paste0("p=", plate), 
                y = v, group = well), size = 0.5)+
  theme_bw()+
  theme(legend.position = "top",
        strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+
  xlab(label = "dose")+
  ylab(label = "migration speed")+
  guides(color = guide_legend(override.aes = list(size = 3)))+
  scale_color_grey(name = "plate")+
  scale_y_log10()
```

Alternatively, we can visualize the well-specific mean velocities to
highlight plate-specific batch effects:

```{r, fig.width=7, fig.height=3}
ggplot(data = aggregate(v~well+group+plate, data = y_p$y, FUN = mean))+
  geom_sina(aes(x = paste0("t=", group), col = paste0("p=", plate), 
                y = v, group = well), size = 1)+
  theme_bw()+
  theme(legend.position = "top",
        strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+
  xlab(label = "dose")+
  ylab(label = "migration speed")+
  guides(color = guide_legend(override.aes = list(size = 3)))+
  scale_y_log10()+
  scale_color_grey(name = "plate")
```

### Analyzing simulated data with `r Biocpkg("cellmig")`

The simulated data must be formatted to match the input requirements of
`r Biocpkg("cellmig")`. Specifically, we need to convert some columns to 
character and add an `offset` column indicating the reference treatment.

```{r, fig.width=7, fig.height=3.5}
sim_data <- y_p$y

# format simulated data to be used as input in cellmig
sim_data$well <- as.character(sim_data$well)
sim_data$compound <- as.character(sim_data$compound)
sim_data$plate <- as.character(sim_data$plate)
sim_data$offset <- 0
sim_data$offset[sim_data$group=="4"] <- 1
```

We now analyze the simulated data using `r Biocpkg("cellmig")` with the 
following control parameters:

```{r, fig.width=7, fig.height=3.5}
osd <- cellmig(x = sim_data,
               control = list(mcmc_warmup = 250,
                              mcmc_steps = 800,
                              mcmc_chains = 2,
                              mcmc_cores = 2,
                              mcmc_algorithm = "NUTS",
                              adapt_delta = 0.8,
                              max_treedepth = 10))
```

### Comparing simulated vs. inferred parameters

We compare the inferred parameters (posterior means and 95% HDIs) with
the true values used in the simulation.

#### Plate-specific offsets ($\alpha_p$)

```{r, fig.width=4, fig.height=2}
ggplot(data = osd$posteriors$alpha_p)+
  geom_point(aes(y = plate_id, x = exp(mean)))+
  geom_errorbarh(aes(y = plate_id, x = exp(mean), xmin = exp(X2.5.), 
                     xmax = exp(X97.5.)), height = 0.1)+
  geom_point(data = data.frame(v = exp(y_p$par$alpha_p),
                               plate = osd$posteriors$alpha_p$plate_id),
             aes(y = plate, x = v), col = "red")+
  scale_y_continuous(breaks = osd$posteriors$alpha_p$plate_id)+
  xlab(label = expression(alpha[p]*"'"))
```

#### Effect sizes ($\delta_t$)

```{r, fig.width=4, fig.height=2.5}
ggplot(data = osd$posteriors$delta_t)+
  geom_point(aes(y = group_id, x = mean))+
  geom_errorbarh(aes(y = group_id, x = mean, xmin = X2.5., xmax = X97.5.), 
                 height = 0.1)+
  geom_point(data = data.frame(delta = c(-0.4, -0.2, -0.1, 0.1, 0.2, 0.4), 
                               group_id = 1:6),
             aes(y = group_id, x = delta), col = "red")+
  xlab(label = expression(delta[t]))+
  ylab(label = "treatment")+
  scale_y_continuous(breaks = 1:8)
```

#### Biological ($\sigma_{bio}$) and technical ($\sigma_{tech}$) variability 

```{r, fig.width=4, fig.height=2}
plot(osd$fit, par = c("sigma_bio", "sigma_tech"))
```


# Data simulation from **fully generative** model 
We can also use `r Biocpkg("cellmig")` to generate cell velocities from 
the prior distributions of the parameters $\rightarrow$ data from prior 
predictive distribution.

In this scenario, the parameters, such as $\delta_t$, will not be fixed to
point values but rather generated from their priors.

## Model specification

```{r}
control <- list(N_biorep = 4, 
                N_techrep = 3, 
                N_cell = 30, 
                N_group = 8,
                prior_alpha_p_M = 1.7,
                prior_alpha_p_SD = 1.0,
                prior_kappa_mu_M = 1.7,
                prior_kappa_mu_SD = 1.0,
                prior_kappa_sigma_M = 0,
                prior_kappa_sigma_SD = 1.0,
                prior_sigma_bio_M = 0.0,
                prior_sigma_bio_SD = 1.0,
                prior_sigma_tech_M = 0.0,
                prior_sigma_tech_SD = 1.0,
                prior_mu_group_M = 0.0,
                prior_mu_group_SD = 1.0)
```


```{r}
# generate data
y_f <- gen_full(control = control)
```

The generate data has a similar structure as before:

```{r}
str(y_f)
```


## Compare density of velocities from partially vs. fully generative models

```{r, fig.width=4, fig.height=4}
w <- data.frame(v_f = y_f$y$v[1:2000], v_p = y_p$y$v[1:2000])
 ggplot(data = w)+
    geom_point(aes(x = v_f, y = v_p), size = 0.5)+
    geom_density_2d(aes(x = v_f, y = v_p), col = "orange")+
    scale_x_log10(name = "Velocity from fully generative model", 
                  limits = c(0.01, 10^4))+
    scale_y_log10(name = "Velocity from partially generative model", 
                  limits = c(0.01, 10^4))+
    annotation_logticks(base = 10, sides = "lb")+
    theme_bw(base_size = 10)
```

# Power analysis
How many biological replicates do we need to capture certain effect size 
($\delta$)? Similar question can be asked in order to optimize the number 
of technical replicates or the number of cells per well.

The code below takes considerable time to be executed. This is because we 
generate N_biorep times N_sim datasets and fit a model. To avoid this during
R-package checks, the following chunks will not be evaluated. Do this 
yourselves.

**Important note**: using multicore execution you can considerably speed-up 
this analysis!

```{r, eval = FALSE, message=FALSE, warning=FALSE}
# Constants
N_bioreps <- c(3, 6, 9)
N_sim <- 10
true_deltas <- c(-0.3, -0.15, 0, 0.2, 0.4)
offset <- 3

# power analysis
deltas <- vector(mode = "list", length = length(N_bioreps)*N_sim)
i <- 1
for(N_biorep in N_bioreps) {
  for(b in 1:N_sim) {
    # simulate data
    y_p <- gen_partial(control = list(N_biorep = N_biorep, 
                                      N_techrep = 3, 
                                      N_cell = 40, 
                                      delta = true_deltas,
                                      sigma_bio = 0.1, 
                                      sigma_tech = 0.05, 
                                      offset = offset,
                                      prior_alpha_p_M = 1.7,
                                      prior_alpha_p_SD = 0.1,
                                      prior_kappa_mu_M = 1.7,
                                      prior_kappa_mu_SD = 0.1,
                                      prior_kappa_sigma_M = 0,
                                      prior_kappa_sigma_SD = 0.1))
    
    # format simulated data to be used as input in cellmig
    sim_data <- y_p$y
    sim_data$well <- as.character(sim_data$well)
    sim_data$compound <- as.character(sim_data$compound)
    sim_data$plate <- as.character(sim_data$plate)
    sim_data$offset <- 0
    sim_data$offset[sim_data$group==offset] <- 1
    
    # run cellmig
    o <- cellmig(x = sim_data,
                   control = list(mcmc_warmup = 300,
                                  mcmc_steps = 1000,
                                  mcmc_chains = 1,
                                  mcmc_cores = 1,
                                  mcmc_algorithm = "NUTS",
                                  adapt_delta = 0.8,
                                  max_treedepth = 10))
    
    delta <- o$posteriors$delta_t
    delta$b <- b
    delta$N_biorep <- N_biorep
    delta$true_deltas <- true_deltas[-offset]
    delta$TP <- (delta$X2.5. <= delta$true_deltas & 
                   delta$X97.5. >= delta$true_deltas) &
      !(delta$X2.5. <= 0 & delta$X97.5. >= 0)
    deltas[[i]] <- delta
    i <- i + 1
  }
}

rm(i, delta, o, sim_data, y_p, b, N_biorep, offset, true_deltas)

deltas <- do.call(rbind, deltas)
```

What do you see? For Large effect ($|\delta_t|>>0$) we need few biological 
replicates to capture the true effect and not the null effect ($\delta_t=0$),
i.e., in nearly all simulations we have a true positive. For smaller effects,
this is not the case, and we probably need many more biological replicates
to each a high true positive rate.

```{r, eval = FALSE, fig.height=3, fig.width = 6}
ggplot(data = aggregate(TP~N_biorep+true_deltas, data = deltas, FUN = sum))+
  geom_point(aes(x = N_biorep, y = TP, col = abs(true_deltas), 
                 group = as.factor(true_deltas)), size = 2, alpha = 0.5)+
  geom_line(aes(x = N_biorep, y = TP, col = abs(true_deltas), 
                group = as.factor(true_deltas)), alpha = 0.5)+
  ylab(label = "Number of true positives (TPs)")+
  scale_color_distiller(name = expression("|"*delta[t]*"|"),palette="Spectral")+
  theme_bw(base_size = 10)
```


# Session Info

```{r}
sessionInfo()
```