--- title: 'Tree based latent variables correlation model' author: "Anna Freni-Sterrantino, Denis Rustand, Janet van Niekerk, Elias T Krainski, and H\\aa vard Rue" date: "Started in November 2023, updated in `r format(Sys.time(), '%B, %Y')`" output: - rmarkdown::html_document - rmarkdown::pdf_document vignette: > %\VignetteIndexEntry{Tree based latent variables correlation model} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} header-includes: - \usepackage{lscape} - \usepackage{multicol} bibliography: references.bib --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 5, out.width = '0.49\\textwidth', fig.align = 'center') library(graphpcor) chinclude = TRUE ``` This material illustrate and detail the model properties of the graphical approach to model correlations proposed in @sterrantino2025graphical. We first formalize the specific kind of graph used to define the model and provides an intuitive visualization. We use some examples for illustration and show how to implement the tree based correlation model proposed. We add details of the model definition at the end. # The directed tree model definition A graph is defined from a set of nodes and edges, where the nodes are linked by the edges. A directed acyclic graph consider directed edges linking the nodes. In the later case there is no closed loop. Yet, a tree is a graph where there is only one possible path between a pair of nodes. Furthermore, the edges of a tree are directed edges. This also impose a hierarchy on the nodes. ## The parents and children tree We consider a particular tree definition where we have two different class of nodes to represent two kind of variables. We classify the nodes as parent or children. The nodes classified as children are leafs of the tree, which are nodes with an ancestor (parent) but with no children. The nodes classified as parent will always have children nodes, and some may have parent but will still be classified as parent because they have children. The directed edges (arrows) represent conditional distributions. At the end of this material we provide some detailed definition and examples. Suppose we have $k$ variables of interest, and want to model their correlation. These $k$ variables will be represented by the children nodes and labeled as $c_i$, $i\in\{1,\ldots,m\}$. The parent nodes are then labeled as $p_j$, $j\in\{1,\ldots,k\}$. The parent $p_1$ is the main parent, with no parent, which is the root of the tree. We have some examples in Figure 1 bellow. ```{r graphs, echo = FALSE, fig.width = 15, fig.height = 7, out.width="99%", fig.cap = "Graphs representing different models. A model with one parent, (A), a model with two parents (B), and two different models with three parents (C) and (D)."} tree1 <- treepcor(p1 ~ c1 + c2 - c3) tree2 <- treepcor(p1 ~ p2 + c1 + c2, p2 ~ -c3 + c4) tree3a <- treepcor(p1 ~ p2 + p3 + c1 + c2, p2 ~ -c3 + c4, p3 ~ c5 + c6) tree3b <- treepcor(p1 ~ p2 + c1 + c2, p2 ~ p3 - c3 + c4, p3 ~ c5 + c6) par(mfrow = c(2, 2), mar = c(0, 0, 0, 0)) plot(tree1) legend("topleft", "(A)", bty = "n") plot(tree2) legend("topleft", "(B)", bty = "n") plot(tree3a) legend("topleft", "(C)", bty = "n") plot(tree3b) legend("topleft", "(D)", bty = "n") ``` **NOTE**: The children variable `c3` is with a negative sign, assuming it has a negative correlation with the other children variables. If a graph is used to represent conditional distributions, when computing the marginal distribution, it induces a the covariance between two nodes to be non-zero covariance as long as there is a path linking them in the graph. The same apply in our approach. Moreover, the strength of the correlation depends on how many parents, and their hierarchy, there is in the path linking them. ## The implied covariance Let us represent the correlation between $a$ and $b$ as C(a,b). In the model defined with graph (A), |C($c_1$,$c_2$)| = |C($c_1$,$c_3$)| = |C($c_2$,$c_3$)|. From graph (B) |C($c_1$,$c_3$)| = |C($c_1,c_4$)| = |C($c_2$,$c_3$)| = |C($c_2$,$c_4$)| $\leq$ |C($c_1$,$c_2$)| $\leq$ |C($c_3$,$c_4$)|. With the models defined from the graphs (C) and (D) we have |C($c_1$,$c_2$)|$\leq$|C($c_3$,$c_4$)|, |C($c_1$,$c_2$)|$\leq$|C($c_5$,$c_6$)|, and |C($c_1$,$c_3$)|=|C($c_1$,$c_4$)|= |C($c_2$,$c_3$)|=|C($c_2$,$c_4$)| and |C($c_1$,$c_5$)|=|C($c_1$,$c_6$)|= |C($c_2$,$c_5$)|=|C($c_2$,$c_6$)|. If we want to not fix the order between |C($c_3$,$c_4$)| and |C($c_5$,$c_6$) we can use the model from graph (C), whereas model (D) impose |C($c_3$,$c_4$)|$\leq$|C($c_5$,$c_6$)| and the correlation strength between a children of $p_3$ to a children of $p_1$ to be smaller than to a children of $p_2$, e.g. |C($c_1$,$c_5$)|$\leq$|C($c_3$,$c_5$)|. These results follow from the application of the law of total expectation, variance and covariance. Let us start by having Var($p_j$) = $\gamma_j^2$ and the conditional variance for each children as Var($c_i$|$p_{c_i}$)=1. Thus, the marginal variance of $C_i$ is equal the sum of the variances of its ancestors plus 1. # Illustrative code examples In the \textbf{\textsf{R}} environment we represent the parent variables with the letter ``p`` along with an integer number (``p1``, $\ldots$, ``pk``) and the children variables with the letter ``c`` along with an integer number (``c1``, $\ldots$, ``cm``). We adopt a simple way to specify the parent children representation. We consider the ``~`` (tilde) to represent the directed link and ``+`` (plus) or ``-`` (minus) to append the descendant to a parent. E.g.: ``p1 ~ p2 + c1 + c2, p2 ~ c3 - c4`` . ## Intial example Let us consider a correlation model for three variables, with one parameter, that is the same absolute correlation between each pair but the sign may differ. This can be the case when these variables share the same latent factor. The parent is represented as ``p1``, and the children variables as ``c1``, ``c2`` and ``c3``. We consider that ``c3`` will be negatively correlated with ``c1`` and ``c2``. For this case we define the tree as ```{r graph1, include = chinclude} tree1 <- treepcor(p1 ~ c1 + c2 - c3) tree1 summary(tree1) ``` where the summary shows their relationship. The number of children and parent variables are obtained with ```{r dim1} dim(tree1) ``` This tree can be visualized with ```{r visgraph1, include = chinclude} plot(tree1) ``` From this model definition we will use the methods to build the correlation matrix. First, we build the precision matrix structure (that is not yet a precision matrix): ```{r qs1, include = chinclude} prec(tree1) ``` and we can use inform the log of $\gamma_1$, which is the standard error for $p_1$, with: ```{r q, include = chinclude} q1 <- prec(tree1, theta = 0) q1 ``` We can obtain the correlation matrix, which is our primarily interest, from the precision matrix. However, also have a covariance method to be directly applied with ```{r v1, include = chinclude} vcov(tree1) ## assume theta = 0 (\gamma_1 = 1) vcov(tree1, theta = 0.5) # \gamma_1^2 = exp(2 * 0.5) = exp(1) cov1a <- vcov(tree1, theta = 0) cov1a ``` from where we obtain the desired matrix with ```{r c1, include = chinclude} c1 <- cov2cor(cov1a) round(c1, 3) ``` ## Correlation matrix with two parameters In this example, we model the correlation between four variables using two parameters. We consider ``c1`` and ``c2`` having the same parent, ``p1`` and ``c3`` and ``c4`` having the second parent as parent. We want to have the correlation between ``c3`` and ``c4`` higher than the correlation between ``c1`` and ``c3``. This requires ``p2`` to be children of ``p1``. The tree for this is set by ```{r graph2} tree2 <- treepcor( p1 ~ p2 + c1 + c2, p2 ~ -c3 + c4) dim(tree2) tree2 summary(tree2) ``` which can be visualized by ```{r visgraph2} plot(tree2) ``` We can use the `drop1` method to remove the last parent with ```{r drop1} drop1(tree2) ``` which align with the prior for the model parameters as proposed in @sterrantino2025graphical. However, the code to implement such a prior is an internal C code and do not use this `drop1` which is fully a R code. In order to apply this for real data we do recommend to use the \textbf{\textsf{R}} as shown further ahead in this vignette. We now have two parameters: $\gamma_1^2$ the variance of $p_1$ and $\gamma_2^2$ the conditional variance of $p_2$. For $\gamma_1 = \gamma_2 = 1$, the precision matrix can be obtained with: ```{r q2} q2 <- prec(tree2, theta = c(0, 0)) q2 ``` The correlation matrix can be obtained with ```{r c2} cov2 <- vcov(tree2, theta = c(0, 0)) cov2 c2 <- cov2cor(cov2) round(c2, 3) ``` ## Playing with sign We can change the sign at any edge of the graph. The change in the edge of parent to children is simpler to interpret, as we can see in the covariance/correlation from the two examples. Let us consider the second example but change the sign between the parents and swap the sign in both terms of the second equation: ```{r graph2b} tree2b <- treepcor( p1 ~ -p2 + c1 + c2, p2 ~ -c3 + c4) tree2b summary(tree2b) ``` This gives the precision matrix as ```{r prec2} q2b <- prec(tree2b, theta = c(0, 0)) q2b ``` The covariance computed from the full precision (and the correlation) between children is the same as before ```{r cov2b} all.equal(solve(q2)[1:4, 1:4], solve(q2b)[1:4, 1:4]) ``` Therefore, allowing flexibility in an edge of parent to another parent is not useful and will only imply more complexity. Therefore we will not consider it in the `vcov` method. **NOTE**: The `vcov` applied to a `treepcor` does not takes into account the sign between parent variables! So, please use it with care. In case we just want to compute the raw covariance, assuming that V($p_j$)=1, to see the law of sum of for the four examples in Figure 1 we have ```{r vfig1} vcov(tree1, raw = TRUE) vcov(tree2, raw = TRUE) tree3a <- treepcor(p1 ~ p2 + p3 + c1 + c2, p2 ~ -c3 + c4, p3 ~ c5 + c6) vcov(tree3a, raw = TRUE) tree3b <- treepcor(p1 ~ p2 + c1 + c2, p2 ~ p3 - c3 + c4, p3 ~ c5 + c6) vcov(tree3b, raw = TRUE) ``` **NOTE**: The `vcov` method for a `treepcor` without the `theta` argument assume $\gamma_j=1$ and $\sigma_i=1$. With not providing `raw` (or `raw = FALSE`) it will standardize the raw matrix then apply the $\sigma_i$. # The penalized complexity prior To implement the prior proposed in @sterrantino2025graphical we have used the `cgeneric` interface. First, let us define the model and its parameters. Let us consider the two parents example. ```{r tree2c} tree2 (np <- dim(tree2)) ``` Define the standard deviation parameters for the variables of interest ```{r sigmas} sigmas <- c(4, 2, 1, 0.5) ``` and the log of the parent variance parameters ```{r pparams} thetal <- c(0, 1) ``` With that we can build the covariance with ```{r Vg} theta1 <- c(log(sigmas), thetal) Vg <- vcov(tree2, theta = theta1) Vg ``` Now we can simulate some data ```{r data2} nrep <- 100 nd <- nrep * np[1] xx <- matrix(rnorm(nd), nrep) %*% chol(Vg) cov(xx) theta.y <- log(10) datar <- data.frame( r = rep(1:nrep, np[1]), i = rep(1:np[1], each = nrep), y = rnorm(nd, 1 + xx, exp(-2*theta.y)) ) ``` We now use the `cgeneric` interface implemented by the \textbf{\textsf{INLAtools}} package. For that, we can apply the `cgeneric` method to a `treepcor` object. The prior parameter is the `param` argument. ```{r cg2} cmodel <- cgeneric( tree2, lambda = 5, sigma.prior.reference = rep(1, np[1]), sigma.prior.probability = rep(0.05, np[1])) ``` The following code chunks require INLA installed. ```{r haveINLA, echo = FALSE} haveINLA <- FALSE ``` Perform some checks (INLA is required): ```{r ccheck, eval = haveINLA} graph(cmodel) initial(cmodel) prior(cmodel, theta = rep(0, sum(np))) prior(cmodel, theta = rep(1, sum(np))) np prec(cmodel, theta = rep(0, sum(np))) (Qc <- prec(cmodel, theta = theta1)) all.equal(Vg, as.matrix(solve(Qc))) ``` Model fit (if have INLA installed) ```{r mfit, eval = haveINLA} m1 <- y ~ f(i, model = cmodel, replicate = r) pfix <- list(prec = list(initial = 10, fixed = TRUE)) fit <- inla( formula = m1, data = datar, control.family = list(hyper = pfix) ) fit$cpu.used ``` Compare the true covariance, the observed covariance and the fitted covariance (from the posterior mode for theta) ```{r Vcompare, eval = haveINLA} round(Vg, 2) round(cov(xx), 2) round(Vfit <- vcov(tree2, theta = fit$mode$theta), 2) ``` The fitted correlation (from posterior mode): ```{r Cfit, eval = haveINLA} round(Cfit <- vcov(tree2, theta = fit$mode$theta[np[1]+1:np[2]]), 2) round(cov2cor(Vfit), 2) ``` # Model definition details We now define a set of rules to propose a model. We assume a distribution for the main parent, $p_1$, and conditional distributions for each of the remaining $k-1$ parents and each of the $m$ children. The arrows represent each of these conditional distributions. Since we assume Gaussian distributions, we just need to define the (conditional) mean and variance. * **1st rule**: For the main parent $p_1$, the mean is zero and the variance is $\gamma_1^2$. * **2nd rule**: The conditional mean for $p_j$, $j=2$,...$k$, is its parent and the conditional variance is $\gamma_j^2$. * **3rd rule**: The conditional mean of $c_i$, $i=1$, ..., $m$, is $s_i$ times its parent and unit conditional variance, where $s_i$ is assumed known, either "-1" or "+1". We have $p$ parameters, one for each parent. These conditional distributions build up a joint one that will have a covariance matrix. Since we are primarily interested in the correlations, we will scale it by the implied variances and obtaining a correlation matrix. The marginal variances for each children are assigned after this scaling. As we will usually have $k