--- title: "Parallel SEM Forests with the future Package" author: "Andreas M. Brandmaier" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Parallel SEM Forests with the future Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette shows how to run SEM forests in parallel using the [`future`](https://future.futureverse.org/) package. The future package provides a standardized way of evaluating R expressions asynchronously using the computing resources available to the user -- whether it is a single machine with multiple cores or a computing cluster. We build on the latent growth curve model from the *Getting Started* vignette and use a `future::plan()` to distribute trees across the available CPU cores of a local machine. ## Packages ```{r} library(semtree) library(OpenMx) library(future) ``` ## Simulate the latent growth curve data We reuse the linear latent growth curve example: each individual has five repeated measures (`X1` to `X5`) and one dichotomous predictor (`P1`) that influences the mean slope. Setting a seed keeps the simulation reproducible. ```{r} set.seed(23) N <- 1000 M <- 5 icept <- rnorm(N, 10, sd = 4) slope <- rnorm(N, 3, sd = 1.2) p1 <- sample(c(0, 1), size = N, replace = TRUE) loadings <- 0:4 x <- (slope + p1 * 5) %*% t(loadings) + matrix(rep(icept, each = M), byrow = TRUE, ncol = M) + rnorm(N * M, sd = .08) growth.data <- data.frame(x, factor(p1), factor(sample(c(0,1),N,TRUE))) names(growth.data) <- c(paste0("X", 1:M), "P1","Noise") ``` ## Specify and fit the SEM The model is identical to the one used in the introductory vignette: a two-factor latent growth curve with fixed factor loadings for the intercept and slope. ```{r} manifests <- names(growth.data)[1:5] growthCurveModel <- mxModel("Linear Growth Curve Model Path Specification", type="RAM", manifestVars=manifests, latentVars=c("intercept","slope"), mxData(growth.data, type="raw"), # residual variances mxPath( from=manifests, arrows=2, free=TRUE, values = c(.1, .1, .1, .1, .1), labels=c("residual","residual","residual","residual","residual") ), # latent variances and covariance mxPath( from=c("intercept","slope"), arrows=2, connect="unique.pairs", free=TRUE, values=c(2, 0, 1), labels=c("vari", "cov", "vars") ), # intercept loadings mxPath( from="intercept", to=manifests, arrows=1, free=FALSE, values=c(1, 1, 1, 1, 1) ), # slope loadings mxPath( from="slope", to=manifests, arrows=1, free=FALSE, values=c(0, 1, 2, 3, 4) ), # manifest means mxPath( from="one", to=manifests, arrows=1, free=FALSE, values=c(0, 0, 0, 0, 0) ), # latent means mxPath( from="one", to=c("intercept", "slope"), arrows=1, free=TRUE, values=c(1, 1), labels=c("meani", "means") ) ) # close model growthCurveModel <- mxRun(growthCurveModel) ``` ## Enable parallelism with `future` `semtree::semforest()` uses the `future.apply` infrastructure internally, so registering a parallel plan automatically parallelizes tree construction. Here we choose a conservative number of workers to keep vignette builds light-weight, that is, a maximum of two workers (if available), and we restore the previous plan when we're done. Please see the documentation of `future` for other plans, such as `sequential` for non-parallel execution or `cluster` for using a computing cluster. ```{r} old_plan <- future::plan() on.exit(future::plan(old_plan), add = TRUE) workers <- max(1, min(2, future::availableCores())) future::plan(future::multisession, workers = workers) ``` ## Grow the SEM forest in parallel With the plan in place, we can grow a forest. We keep the number of trees modest for the sake of a fast example; in practice, increase `num.trees` (e.g., 200--500) for stable variable importance estimates. ```{r runforest, message=FALSE, warning=FALSE, echo=FALSE} set.seed(99) control <- semforest_control(num.trees = 20, mtry = 1) forest <- semforest( model = growthCurveModel, data = growth.data, control = control ) ``` ## Inspect variable importance Because the latent growth data only include the grouping predictor `P1`, importance results confirm that this variable drives heterogeneity in the model parameters. Parallel evaluation speeds up the permutation step when multiple predictors are present. ```{r} vim <- varimp(forest) print(vim, sort.values = TRUE) plot(vim) ``` After the forest is built, `future::plan(old_plan)` (triggered via `on.exit`) returns the session to its prior parallel configuration.