--- title: "Use Case: Linear Regression with Outlier Correction and Bootstrap" author: "Andreas Brandmaier" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Use Case: Linear Regression with Outlier Correction and Bootstrap} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This vignette demonstrates how to perform a robust linear regression analysis on the Boston housing data. We start by setting global chunk options and loading the required packages. ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r libs} library(reproducibleRchunks) library(MASS) ``` # Load Data We load the Boston housing data from the `MASS` package and display the first rows to get an overview of the variables. ```{reproducibleR loadData} boston <- MASS::Boston knitr::kable(head(boston)) ``` # Outlier Correction Next, we remove observations with extremely high or low values of `medv` using the IQR rule to reduce the influence of outliers. ```{reproducibleR outliers} iqr_medv <- IQR(boston$medv) q <- quantile(boston$medv, c(0.25, 0.75)) lower <- q[1] - 1.5 * iqr_medv upper <- q[2] + 1.5 * iqr_medv boston <- boston[boston$medv >= lower & boston$medv <= upper, ] ``` # Linear Regression We then fit a simple linear regression predicting median house value from the percentage of lower status population (`lstat`). ```{reproducibleR linreg} model <- lm(medv ~ lstat, data = boston) summary(model) ``` # Bootstrap Coefficients To assess the variability of the coefficients we perform a simple bootstrap. Because analyses based on random numbers lead to different results across runs, we set a random seed to make this step reproducible. See Brandmaier et al. for a detailed discussion of reproducibility and random numbers ([https://qcmb.psychopen.eu/index.php/qcmb/article/view/3763/3763.html](https://qcmb.psychopen.eu/index.php/qcmb/article/view/3763/3763.html)). ```{reproducibleR bootstrap} set.seed(123) boot_coefs <- replicate(1000, { idx <- sample(nrow(boston), replace = TRUE) coef(lm(medv ~ lstat, data = boston[idx, ])) }) boot_means <- rowMeans(boot_coefs) boot_means ``` # Plot Finally, we visualise the relation between `lstat` and `medv` together with the fitted regression line. Since plots do return objects, they are not tested for reproducibility. ```{reproducibleR plots} plot(boston$lstat, boston$medv, xlab = "Lower status (% population)", ylab = "Median value of homes", main = "Boston Housing Data") abline(model, col = "red") ```