---
title: "cellmig: quantifying cell migration with hierarchical Bayesian models"
author: "Simo Kitanovski (simo.kitanovski@uni-due.de)"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{User Manual: cellmig}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include = FALSE, warning = FALSE}
knitr::opts_chunk$set(comment = FALSE, 
                      warning = FALSE, 
                      message = FALSE)
```

```{r}
library(cellmig)
library(ggplot2)
library(ggforce)
ggplot2::theme_set(new = theme_bw(base_size = 10))
```

# Background

High-throughput tracking of cells with time-lapse microscopy followed by
the acquisition of images at ﬁxed time intervals facilitates the
analysis of cell migration across many wells treated under different
biological conditions. These workflows generate considerable technical
noise and biological variability, and therefore technical and biological
replicates are necessary, leading to large, hierarchically structured
datasets, i.e., cells are nested within technical replicates that are
nested within biological replicates.

Current statistical analyses of such data usually ignore the
hierarchical structure of the data and fail to explicitly quantify
uncertainty arising from technical or biological variability. To address
this gap, we present `r Biocpkg("cellmig")`, an R package implementing
Bayesian hierarchical models for migration analysis.
`r Biocpkg("cellmig")` quantifies condition-specific velocity changes
(e.g., drug effects) while modeling nested data structures and technical
artifacts, providing uncertainty-aware estimates through credible
intervals.

There are currently no Bioconductor packages providing specialized
statistical methods for analyzing hierarchical high-throughput cell
migration data. `r Biocpkg("cellmig")` addresses this gap and will
represent a valuable addition to the ecosystem.


# Installation
To install this package, start R (version "4.5") and enter:

```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("cellmig")
```


# Data

This is how a typical cell migration data looks like $\rightarrow$ a
table.

Each rows is a cell with the following features:

-   well = unique well ID (w1, w2, w3, etc.).
-   plate = unique plate ID (p1, p2, p3, etc.). Each plate is a
    biological replicate. A plate contains multiple wells, some of which
    are treated with the same compound and dose (technical replicates)
-   compound = compound name (c1, c2, c3, etc.)
-   dose = compound concentration (0, 1, 5, 10, low, mid, high, etc.)
-   v = Observed cell migration velocity (numeric)
-   offset = binary (0 or 1). Indicates whether a treatment should be
    used for batch correction across plates. By default offset = 0 (no
    correction). Set to 1 for specific treatment groups (compound x
    dose) used as offsets. Ensure that this treatment group appears on
    each plate.

```{r}
data("d", package = "cellmig")
str(d)
```

In this vignette we will use simulated data from:

-   **plates** ($p$): 1, 2, ... , 3
-   **wells** ($w$): 1, 2, ... , 378
-   **cells** per well with their migration velocity `v`
-   wells are treated with **compounds** 1, 2, ..., 6 at **dose** 1, 2,
    ..., 7.
-   combination of a compound and dose is a **treatment group** ($t$)
    $\rightarrow$ 1, 2, ..., 42.

Let's visualize the data. Each dot represents a cell with its velocity
on the y-axis. Each facet corresponds to a compound (e.g., specific drug
that may affect cellular velocity). The x-axis represents the dose.
There are three plates, indicated by color. Four technical replicates
(wells), analyzed on the same plate, are stacked next to each other and
have the same color.

```{r, fig.width=7, fig.height=6}
ggplot(data = d)+
  facet_wrap(facets = ~paste0("compound=", compound), 
             scales = "free_y", ncol = 2)+
  geom_sina(aes(x = as.factor(dose), col = plate, y = v, group = well), 
            size = 0.5)+
  theme_bw()+
  theme(legend.position = "top",
        strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+
  ylab(label = "migration velocity")+
  xlab(label = '')+
  scale_color_grey()+
  guides(color = guide_legend(override.aes = list(size = 3)))+
  guides(shape = guide_legend(override.aes = list(size = 3)))+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "l")
```

## **Mean migration velocity** per well

Alternatively, we can visualize the well-specific mean velocities to
highlight plate-specific batch effects.

```{r, fig.width=7, fig.height=6}
dm <- aggregate(v~well+plate+compound+dose, data = d, FUN = mean)
ggplot(data = dm)+
  facet_wrap(facets = ~paste0("compound=", compound), 
             scales = "free_y", ncol = 2)+
  geom_sina(aes(x = as.factor(dose), col = plate, y = v, group = well), 
            size = 1.5, alpha = 0.7)+
  theme_bw()+
  theme(legend.position = "top",
        strip.text.x = element_text(margin = margin(0.03,0,0.03,0, "cm")))+
  ylab(label = "migration velocity")+
  xlab(label = '')+
  scale_color_grey()+
  guides(color = guide_legend(override.aes = list(size = 3)))+
  guides(shape = guide_legend(override.aes = list(size = 3)))+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "l")
```

# `cellmig` analysis

We will use this data to infer the **overall treatment effects**
(parameter $\delta_t$), relative to a control treatment (the offset) to
correct for plate-specific batch effects. At the same time,
`r Biocpkg("cellmig")` will quantify many different features of the data
using its model parameters (e.g., variability between technical or
biological replicates; or plate-specific treatment effects
($\gamma_{pt}$)).

## Model fitting

We fit the Stan model employed by `r Biocpkg("cellmig")` with the
control parameters defined in the list `control`. There are many other
input parameters in `control`, check the `cellmig` function documentation.

```{r, fig.width=7, fig.height=3.5}
o <- cellmig(x = d,
             control = list(mcmc_warmup = 300, # nr. of MCMC warmup step?
                            mcmc_steps = 1000, # nr. of MCMC iteration steps?
                            mcmc_chains = 2,   # nr. of MCMC chains
                            mcmc_cores = 2))   # nr. of MCMC cores
```

## What are the **overall treatment effects** ($\delta_t$) on velocity?

To extract the means, medians, and 95% Highest Density Intervals (HDIs,
quantifying parameter value uncertainty) of $\delta_t$, we have to
access the data.frame `delta_t` in the output object `posteriors`:

```{r}
str(o$posteriors$delta_t)
```

It is better to visualize the mean $\delta_t$s and their 95% HDIs

-   Dot: Posterior mean of $\delta_t$
-   Error bar: 95% highest density interval (HDI) of $\delta$
-   $\exp(\delta)$: Fold change in cell velocity relative to control

As compound t=1 was selected as control (by setting offset=1), the
treatment effects of this compounds are not shown.

```{r, fig.width=6, fig.height=3.3}
ggplot(data = o$posteriors$delta_t)+
  geom_line(aes(x = dose, y = mean, col = compound, group = compound))+
  geom_point(aes(x = dose, y = mean, col = compound))+
  geom_errorbar(aes(x = dose, y = mean, ymin = X2.5., ymax = X97.5., 
                    col = compound), width = 0.1)+
  ylab(label = expression("Overall treatment effect ("*delta*")"))+
  theme(legend.position = "top")
```

## Compare the dose-response `profiles` for different compounds

For "rectangular datasets", i.e. datasets with multiple compounds and
overlapping doses, we can study the treatment dose-response profiles by
hierarchical clustering based on the complete posteriors of $\delta_t$,
account for uncertainty in this parameter.

Panel A: dendrogram constructed by hierarchical clustering with average
linkage, based on euclidean distances between vectors of $\delta_t$
(shown in panel B) of each compound (leaf) across doses. Branch support
values show branch robustness (label = 1000 implies this branch was
encountered in each of the 1000 dendrograms constructed from the
posterior of $\delta_t$). Plate-specific treatment dose-responses based
on parameters $\gamma_pt$.

-   Dot in panel B/C: Posterior mean of $\delta_t$ and $\gamma_pt$
-   Error bar: 95% highest density interval (HDI)

```{r, fig.width=9.5, fig.height=5}
get_dose_response_profile(x = o)+
  patchwork::plot_layout(widths = c(.7, 1, 4))
```

## Compare the effects between treatment group

Pairwise dot-plot comparison $\rightarrow$ x minus y axis

(Left panel) Differences in overall treatment effects. Log fold change
(LFC; described by parameter $\rho_{ij}$) between overall treatments
effects ($\delta_t$) of row ($i$) vs. column ($j$) treatment groups.
Tile colors and labels represent $\rho_{ij}$. (Right panel) Probability
of differential treatment effect described by parameter $\pi_{ij}$. Tile
colors and labels represent $\pi_{ij}$.

-   x/y-axis treatment groups (combinations of compounds and doses)
-   $\rho$: Difference between treatment groups at y-x axis.
-   $\pi$: probability of observing either a completely positive or
    negative $\rho$

```{r, fig.width=14, fig.height=6}
u <- get_pairs(x = o, exponentiate = FALSE)
u$plot
```

## Violin plot based comparison

-   from_groups: vector of treatment groups to consider (combinations of
    compounds and doses)
-   to_group: target treatment group
-   violins show the posterior distributions of the differences ($\rho$:
    each element from `from_groups` vs. `to_group`).
-   label: probability, $\pi$, of observing completely positive or
    negative $\rho$

```{r, fig.width=7, fig.height=2.5}
get_groups(x = o)
u <- get_violins(x = o, 
                 from_groups = get_groups(x = o)$group,
                 to_group = "C2|D1",
                 exponentiate = FALSE)
u$plot
```

## Posterior predictive checks (PPCs)

To assess model validity, we performed posterior predictive checks,
which showed that the simulated data (pink violin) were consistent with
the observed data (black violins). Each dot is a cells.

```{r, fig.width=5, fig.height=6}
g <- get_ppc_violins(x = o, wrap = TRUE, ncol = 3)
g+scale_y_log10()
```

Using posterior predictive checks we compared the mean simulated
velocity per well (y-axis) with the observed mean per well (x-axis).
Each dot is a well.

```{r, fig.width=4, fig.height=4}
g <- get_ppc_means(x = o)
g
```

## Inspecting other model parameters

```{r, fig.height=2, fig.width=6}
g_alpha_p <- ggplot(data = o$posteriors$alpha_p)+
  geom_errorbarh(aes(y = plate_id, x = mean, xmin = X2.5., xmax = X97.5.),
                 height = 0.1)+
  geom_point(aes(y = plate_id, x = mean))

g_sigma <- ggplot()+
  geom_errorbarh(data = o$posteriors$sigma_bio,
                 aes(y = "sigma_bio",
                     x = mean, xmin = X2.5., xmax = X97.5.),
                 height = 0.1)+
  geom_errorbarh(data = o$posteriors$sigma_tech,
                 aes(y = "sigma_tech",
                     x = mean, xmin = X2.5., xmax = X97.5.),
                 height = 0.1)+
  geom_point(data = o$posteriors$sigma_bio,
             aes(y = "sigma_bio", x = mean))+
  geom_point(data = o$posteriors$sigma_tech,
             aes(y = "sigma_tech", x = mean))+
  ylab(label = '')

g_alpha_p|g_sigma
```

# Session Info

```{r}
sessionInfo()
```