--- title: "Model-based continuous summary tables in R" description: > Build model-based summary tables for continuous outcomes in R with table_continuous_lm(): estimated marginal means, robust / cluster-robust / bootstrap / jackknife standard errors, case weights, additive covariate adjustment (G-computation or equal-weight), four effect-size families (f2, Cohen's d, Hedges' g, Hays' omega2) with noncentral CIs, and APA-style output to console, gt, tinytable, flextable, Excel, Word, or clipboard. output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Model-based continuous summary tables in R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) build_rich_tables <- identical(Sys.getenv("IN_PKGDOWN"), "true") source("_pkgdown-helpers.R") ``` ```{r setup} library(spicy) ``` `table_continuous_lm()` is the model-based companion to `table_continuous()`. It fits one linear model per selected continuous outcome -- `lm(outcome ~ by + covariate1 + covariate2 + ...)` if covariates are supplied, otherwise `lm(outcome ~ by)` -- and returns a compact reporting table. This is the right choice when you want to stay in a linear-model workflow with heteroskedasticity-consistent or cluster-robust standard errors, bootstrap / jackknife variance, case weights, additive covariate adjustment, or one of four effect-size families (`f2`, Cohen's *d*, Hedges' *g*, Hays' \(\omega^2\)) with noncentral CIs. ## Basic usage Use `select` for one or more continuous outcomes and `by` for the single predictor: ```{r basic} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi, life_sat_health), by = sex ) ``` For categorical predictors, the table reports estimated means by level. When the predictor is dichotomous, it can also show a single mean difference and confidence interval. ## Robust standard errors Use `vcov = "HC*"` when you want heteroskedasticity-consistent standard errors and tests: ```{r robust} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = sex, vcov = "HC3", statistic = TRUE ) ``` The `HC*` family is computed via [sandwich::vcovHC()] and includes `"HC0"`, `"HC1"`, `"HC2"`, `"HC3"` (default for small / moderate samples), `"HC4"`, `"HC4m"`, and `"HC5"`. ## Cluster-robust standard errors When observations are not independent (repeated measurements per individual, students nested in classes, panel data), classical and `HC*` standard errors are biased downward. Use the `CR*` family together with `cluster = id_var` to get cluster-robust inference, dispatched to `clubSandwich`: ```{r cluster, eval = requireNamespace("clubSandwich", quietly = TRUE)} table_continuous_lm( sleep, select = extra, by = group, vcov = "CR2", cluster = ID, statistic = TRUE ) ``` `"CR2"` is the recommended default (Bell & McCaffrey 2002; Pustejovsky & Tipton 2018). It produces fractional Satterthwaite degrees of freedom, rendered in the displayed test header as e.g. `t(8.7)` or `F(2, 12.4)`. `"CR1"` matches Stata's `vce(cluster id)` default. ## Bootstrap and jackknife For situations where the residual distribution is non-standard or the sample is small, `vcov = "bootstrap"` and `vcov = "jackknife"` provide resampling-based variance estimators in pure base R (no dependency added): ```{r bootstrap, eval = FALSE} table_continuous_lm( sochealth, select = wellbeing_score, by = sex, vcov = "bootstrap", boot_n = 1000 # default ) ``` When `cluster` is supplied, bootstrap switches to a cluster bootstrap (Cameron, Gelbach & Miller 2008) and jackknife to leave-one-cluster-out (Quenouille 1956). Both estimators use asymptotic inference: `z` for single-coefficient contrasts and `chi^2(q)` for the global Wald test on `k > 2` categorical predictors, rendered in the displayed test header. ## Case weights Use `weights` when you want weighted estimated means or slopes in the same model-based table: ```{r weights} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = education, weights = weight, show_weighted_n = TRUE ) ``` This is often the most natural summary-table function when your reporting workflow already relies on weighted linear models. ## Numeric predictors If `by` is numeric, `table_continuous_lm()` reports the slope rather than group means: ```{r numeric-by} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = age, vcov = "HC3" ) ``` When you need the underlying returned data for further processing, use `output = "data.frame"` for the wide raw table or `output = "long"` for the analytic long table. ## Covariate adjustment When the comparison of `by`-group means is potentially confounded by other variables, supply them through `covariates`. Each per-outcome linear model becomes `lm(outcome ~ by + covariate1 + covariate2 + ...)`, and the reported `Δ` (categorical `by`) or slope (numeric `by`), its CI, *p*, R² and effect size are all **adjusted for** the covariates. The default ASCII print emits an APA-style footer naming the adjustment so the reader knows which estimand they are reading. ```{r adjustment-basic} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = sex, covariates = c(age, education), vcov = "HC3" ) ``` `covariates` accepts the full tidyselect vocabulary (`c(age, sex)`, `tidyselect::starts_with("life_sat_")`, `tidyselect::all_of(cov_names)`) or a literal character vector. A covariate that also appears in `select` is silently auto-excluded from the outcome list (mirroring how `by` / `weights` are excluded from `select`); a covariate equal to `by` is rejected up front. ### Two estimands for the adjusted marginal means Two methodologically distinct conventions exist for computing the covariate-adjusted per-group means displayed in the table (the `M (level)` columns of the printed output, and the `emmean` / `emmean_se` / `emmean_ci_lower` / `emmean_ci_upper` columns of the `output = "long"` data frame) when `by` is categorical. `table_continuous_lm()` exposes the choice via `adjustment`; both are valid but produce different numerical answers when at least one factor covariate has non-uniform observed proportions. * `adjustment = "proportional"` (the **default**) — G-computation / population-weighted standardisation. For each focal level of `by`, predictions are evaluated at every observation in the data with `by` replaced by that level (covariates kept at their observed values), then averaged. This is the convention of Stata's `margins` and `marginaleffects::avg_predictions(variables)`. Interpretation: *"What is the predicted mean in this population if everyone had `by = lvl`, holding their actual covariate profile?"*. Best for causal / counterfactual reporting. * `adjustment = "balanced"` — equal-weight averaging on a synthetic grid. Factor-covariate level combinations are crossed and weighted uniformly (`1 / k`); numeric covariates are fixed at their sample mean. This is the convention of `emmeans::emmeans()` (default), SPSS UNIANOVA EMMEANS, and SAS LSMEANS. Interpretation: *"What is the predicted mean assuming a balanced covariate design?"*. Best for descriptive / ANOVA-style APA reporting. The two methods coincide trivially when `covariates` is `NULL`, and also when all covariates are numeric or logical (no factor levels to expand over). They diverge only when categorical covariate proportions are skewed in the observed data. ```{r adjustment-balanced} # Same model, balanced (emmeans-style) marginal means: table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = sex, covariates = c(age, education), adjustment = "balanced", vcov = "HC3" ) ``` ### Effect sizes under adjustment Adding covariates changes which effect sizes remain meaningful: * `effect_size = "f2"` and `"omega2"` are reported as **partial** *f²* / partial *ω²* via `stats::drop1()` restricted to the focal term — the correct generalisation of the bivariate effect size to a covariate-adjusted model. * `effect_size = "d"` and `"g"` raise `spicy_unsupported`: Cohen's *d* and Hedges' *g* have no defined extension under adjustment (the pooled SD has no canonical analogue). The error message points to `f2` / `omega2` as the partial-F generalisations. Both effect-size variants and their noncentral CIs (when `effect_size_ci = TRUE`) are computed from the same fitted model as the displayed coefficients and remain invariant to `vcov` (HC* / CR* / bootstrap / jackknife only affect the contrast SE). ### Scope: additive only in v1 `covariates` accepts additive terms only. Formula syntax with interactions or transforms (`covariates = ~ age * sex`, `covariates = ~ I(age^2)`) raises `spicy_unsupported` with a migration hint. Interactions and polynomials are reserved for a future spicy release. To fit a fully-specified model right now, do so with `lm()` directly and inspect via `summary()` or `table_regression()` (which accepts interaction / polynomial / `I()` terms natively). ## Effect sizes `table_continuous_lm()` supports four effect sizes via the `effect_size` argument. All are derived from the **same fitted model** as the displayed coefficients and `R²`, so the table stays internally consistent — and all are **invariant to `vcov`** (the choice of classical or `HC*` standard errors changes the contrast SE, CI, and test statistic but not the standardized effect size itself). Cohen's *d* and Hedges' *g* are the conventions for two-group comparisons (Cohen 1988; Hedges and Olkin 1985; APA 2020) and require `by` to have exactly two non-empty levels: ```{r es-d} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = smoking, effect_size = "d" ) ``` Hedges' *g* applies the small-sample correction `J = 1 - 3/(4·df_resid - 1)`. It is generally preferred over raw *d* in published reports (Goulet-Pelletier and Cousineau 2018): ```{r es-g} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = smoking, effect_size = "g" ) ``` For categorical predictors with three or more levels (or numeric predictors), use Hays' `omega²` for a less biased estimate of the population variance explained (Hays 1963; Olejnik and Algina 2003; Lakens 2013): ```{r es-omega2} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = education, effect_size = "omega2" ) ``` Cohen's `f²` (= `R² / (1 - R²)`) is the effect size familiar from power-analysis frameworks (e.g. G*Power) and is defined for any predictor type: ```{r es-f2} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = age, effect_size = "f2" ) ``` ## Confidence intervals for effect sizes Setting `effect_size_ci = TRUE` adds a confidence interval for the effect size. The implementation uses the modern noncentral-distribution inversion approach — the consensus standard in commercial statistical software (Stata `esize` / `estat esize`, SAS `PROC TTEST` and `PROC GLM EFFECTSIZE`) and in mainstream R packages (`effectsize`, `MOTE`, `TOSTER`, `effsize`): - Noncentral *t* inversion for `"d"` and `"g"` (Steiger and Fouladi 1997; Goulet-Pelletier and Cousineau 2018; Cousineau and Goulet-Pelletier 2021), which has nominal coverage across sample sizes — unlike the older Hedges-Olkin normal approximation that is biased for small samples. - Noncentral *F* inversion for `"omega2"` and `"f2"` (Steiger 2004; Smithson 2003). In the printed and rendered outputs, the effect size is displayed with the bracketed CI (article-ready): ```{r es-ci} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = smoking, effect_size = "g", effect_size_ci = TRUE ) ``` The CI level follows `ci_level` (default `0.95`). For programmatic access, the wide raw output exposes separate numeric columns, and the long output always includes them: ```{r es-ci-raw} table_continuous_lm( sochealth, select = wellbeing_score, by = smoking, effect_size = "g", effect_size_ci = TRUE, output = "data.frame" ) ``` ```{r es-ci-long} out <- table_continuous_lm( sochealth, select = wellbeing_score, by = smoking, effect_size = "g", effect_size_ci = TRUE, output = "long" ) out[, c("variable", "es_type", "es_value", "es_ci_lower", "es_ci_upper")] ``` When `weights` is supplied, all four effect sizes (and their CIs) are computed from the weighted least-squares fit, keeping them consistent with the weighted contrast and its CI (DuMouchel and Duncan 1983). This is the right convention for case-weighted reporting; for propensity-score balance assessment (Austin and Stuart 2015) or complex-survey designs, dedicated packages (`cobalt::bal.tab()` and `survey`) are more appropriate. ## Article-style polish Pretty outcome labels and a comma decimal separator (useful for European reporting): ```{r polish} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = sex, labels = c( wellbeing_score = "WHO-5 wellbeing (0-100)", bmi = "Body-mass index (kg/m²)" ), effect_size = "g", effect_size_ci = TRUE, decimal_mark = "," ) ``` ## Publication-ready output The function supports the same output formats as the other summary-table helpers, including `tinytable`, `gt`, `flextable`, `excel`, `word`, and `clipboard`. ```{r gt-output, eval = build_rich_tables} pkgdown_dark_gt( table_continuous_lm( sochealth, select = c(wellbeing_score, bmi, life_sat_health), by = sex, vcov = "HC3", statistic = TRUE, output = "gt" ) ) ``` ## Tidying for downstream pipelines `table_continuous_lm()` returns an object that can be coerced to a plain `data.frame` / `tbl_df` (stripping the spicy formatting attributes) or piped into `broom::tidy()` / `broom::glance()` for any downstream tidyverse-stats workflow: ```{r tidy-glance} out <- table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = sex, effect_size = "g", effect_size_ci = TRUE ) # One row per estimated parameter: emmean per level, contrast for # binary predictors, slope for numeric predictors. broom::tidy(out) # One row per outcome with model-level statistics: r.squared, # adj.r.squared, F / t, df, p.value, nobs, weighted_n, plus the # effect-size summary. broom::glance(out) ``` ## See also - See `vignette("table-continuous", package = "spicy")` for descriptive continuous summary tables with classical group-comparison tests. - See `vignette("summary-tables-reporting", package = "spicy")` for a cross-function reporting workflow using the summary-table helpers.