%% LyX 2.3.6 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[american,noae]{scrartcl} \usepackage{lmodern} \renewcommand{\sfdefault}{lmss} \renewcommand{\ttdefault}{lmtt} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{geometry} \geometry{verbose,tmargin=1in,bmargin=1in,lmargin=1in,rmargin=1in} \usepackage{float} \usepackage{url} \usepackage{amsmath} \usepackage[authoryear]{natbib} \makeatletter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands. <>= if(exists(".orig.enc")) options(encoding = .orig.enc) @ \providecommand*{\code}[1]{\texttt{#1}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands. %\VignetteIndexEntry{Using rockchalk} \usepackage{Sweavel} \usepackage{graphicx} \usepackage{color} \usepackage[samesize]{cancel} \usepackage{ifthen} \makeatletter \renewenvironment{figure}[1][]{% \ifthenelse{\equal{#1}{}}{% \@float{figure} }{% \@float{figure}[#1]% }% \centering }{% \end@float } \renewenvironment{table}[1][]{% \ifthenelse{\equal{#1}{}}{% \@float{table} }{% \@float{table}[#1]% }% \centering % \setlength{\@tempdima}{\abovecaptionskip}% % \setlength{\abovecaptionskip}{\belowcaptionskip}% % \setlength{\belowcaptionskip}{\@tempdima}% }{% \end@float } %\usepackage{listings} % Make ordinary listings look as if they come from Sweave \lstset{tabsize=2, breaklines=true, %,style=Rstyle} fancyvrb=false,escapechar=`,language=R,% %%basicstyle={\Rcolor\Sweavesize},% backgroundcolor=\Rbackground,% showstringspaces=false,% keywordstyle=\Rcolor,% commentstyle={\Rcommentcolor\ttfamily\itshape},% literate={<<-}{{$\twoheadleftarrow$}}2{~}{{$\sim$}}1{<=}{{$\leq$}}2{>=}{{$\geq$}}2{^}{{$^{\scriptstyle\wedge}$}}1{==}{{=\,=}}1,% alsoother={$},% alsoletter={.<-},% otherkeywords={!,!=,~,$,*,\&,\%/\%,\%*\%,\%\%,<-,<<-,/},% escapeinside={(*}{*)}}% % In document Latex options: \fvset{listparameters={\setlength{\topsep}{0em}}} \def\Sweavesize{\scriptsize} \def\Rcolor{\color{black}} \def\Rbackground{\color[gray]{0.90}} \usepackage{dcolumn} \usepackage{siunitx} \makeatother \usepackage{babel} \usepackage{listings} \renewcommand{\lstlistingname}{\inputencoding{latin9}Listing} \begin{document} \title{Using rockchalk for Regression Analysis\thanks{In case you want to be involved in development, the source code is available on GitHub. Please browse http://github.com/pauljohn32. The read-only git archive address is \protect\url{git://github.com/pauljohn32/rockchalk.git}. Once you learn how to ``clone'' the repository with a git client, then we can discuss ways for you to contribute.}} \author{Paul E. Johnson} \maketitle \begin{abstract} The rockchalk package includes functions for estimating regressions, extracting diagnostic information from them, preparing presentable tables, and creating helpful two and three dimensional plots. It is primarily intended to facilitate teachers and students who are conducting ordinary least squares and elementary generalized models. Compared to other packages that help with regression analysis, this one offers more-than-the-usual amount of emphasis on the analysis of regression models that include interaction terms. It includes functions that can receive a fitted model and then return standardized, mean-centered, and residual-centered regression results. The plotting functions address some limitations of R's \code{termplot} function. Version 1.8 introduces a consistent framework for the creation of ``\code{newdata}'' objects and the calculation of predicted values so that the marginal effects of separate predictors can be easily explored. \end{abstract} % In document Latex options: \fvset{listparameters={\setlength{\topsep}{0em}}} \SweaveOpts{ae=F,nogin=T} <>= options(device = pdf) options(width=80, prompt=" ", continue=" ") options(useFancyQuotes = FALSE) @ \section{Introduction} This is the Spring, 2019 update of the rockchalk package. I offer a course in regression analysis for social and behavioral scientists every year. As the course goes on, I keep track of the difficulties that the students experience with R and I craft functions that facilitate their work on particular assignments. The new features in this edition are the functions \code{descriptiveTables()} and \code{waldt()} (for regression reports), tools for regression simulation, including \code{genCorrelatedData3()} and \code{cutFancy()}, some support for common recoding support for multi-level modeling. Finally, the regression table-writing function the \code{outreg()} can now generate decimal-aligned columns. A new vignette demonstrating the \code{outreg()} function is introduced with the package. As in previous versions, a core of the effort is development of an internally coherent framework that allows students to estimate regressions (usually with lm and glm) and then create useful summary tables and diagnostic reports. It is easy to create ``newdata'' objects and interact with fitted regressions. The \code{newdata()} function will be handy for just about any kind of regression. For the common kinds of regression in base R and lme4, there is also a convenience function called \code{predictOMatic()} that allows one to specify a regression model and then receive predicted values for a range of input values. The default output will be similar to other regression software tools that provide ``marginal effect'' estimates for regression analysis. I have exerted a great deal of effort to improve the consistency of predicted value plots for regression analysis. The \code{plotSlopes()} function is a plotter for linear models with interaction terms. It draws a plot on the screen, but it also creates an output object that can be subjected to a follow-up hypothesis test (by the \code{testSlopes()} function). There is a plot method for testSlopes objects to help with analysis of interactions. For people who want to plot nonlinear regressions, the function \code{plotCurves()} is a replacement for \code{plotSlopes()}. It can handle any sort of nonlinearity and interaction on the right hand side. Where some regression formula can cause R's \code{termplot()} function to fail, \code{plotCurves()} will succeed. The core functionality of many functions has not been altered. Functions to provide standardized regression coefficients, mean-centered, or residual-centered regression coefficients have not changed much, although the plots of them are improved. \section{Facilitating Collection of Summary Information} \subsection{summarize: A replacement for summary} When an R function provides output that is not suitable for a term paper, the student must cobble together some code and assemble a customized table. In my opinion, R's summary() function for data frames is not adequate. It lacks summaries for diversity of observations. In addition, the output is not formatted in a way that is conducive to the creation of plots or tables. As a result, I offer the function summarize(), which separates the numeric from factor variables and provides an alphabetized summary. <>= library(rockchalk) @ Consider the results of applying summarize() to Fox's Chile data set from the carData package: <>= library(carData) data(Chile) (summChile <- summarize(Chile)) @ The result object is a list that includes a data frame for the numeric variables, with named rows and (optionally) alphabetized columns, as well as a separate report for each factor variable. Users who wish to summarize only the numeric variables can run summarizeNumerics() instead, while others who want to summarize only factors can run summarizeFactors(). The output from summarizeFactors() is a list of factor summaries. A companion function is centralValues(), which will provide only one number for each variable in a data frame. For numerics, it returns the mean, while for factor variables, it returns the mode. <>= centralValues(Chile) @ \subsection{Easier predictions and newdata objects} Students struggle with R's predict() methods. The primary challenge is in the creation of a newdata object of interesting values of the predictors. If the newdata argument, here myNewDF in \inputencoding{latin9}\lstinline!predict(m1, newdata = myNewDF)!\inputencoding{utf8} is not exactly right, the command will fail. If the regression formula uses functions like ``as.numeric()'' or ``as.factor()'', its almost impossible for first-time R users to get this right. In version 1.8, I believe I have solved the problem entirely with the functions newdata() and predictOMatic(). Let's fit an example regression with the Chile data set from the carData package. <>= m1 <- lm(statusquo ~ age + income + population + region + sex, data = Chile) @ \noindent The default output of predictOMatic() will cycle through the predictors, one by one. <>= m1pred <- predictOMatic(m1) m1pred @ The newdata() and predictOMatic() functions handle the details and allow the user to adjust their requests to allow for very fine grained control. Here are the key arguments for these functions. \begin{enumerate} \item divider. The name of an algorithm to select among observed values for which predictions are to be calculated. These are the possibilities. \begin{description} \item [{``seq''}] an evenly spaced sequence of values from low to high across the variable. \item [{``quantile''}] quantile values that eminate from the center of the variable ``outward''. \item [{``std.dev.''}] the mean plus or minus the standard deviation, or 2 standard deviations, and so forth. \item [{``table''}] when a variable has only a small set of possible values, this selects the most frequently observed values. \end{description} \item n. The number of values to be selected. \item predVals. Where the divider argument sets the default algorithm to be used, the predVals argument can choose variables for focus and select divider algorithms separately for the variables. It is also allowed to declare particular values. \end{enumerate} The user can request that particular values of the predictors are used, or can declare one of several algorithms for selection of focal values. All of these details are managed by the argument called predVals, which is described in the help pages (with plenty of examples). <>= mypred2 <- predictOMatic(m1, predVals = c("age", "region"), n = 3) mypred2 mypred3 <- predictOMatic(m1, predVals = c(age = "std.dev.", region = "table"), n = 3) mypred3 mypred4 <- predictOMatic(m1, predVals = list(age = c(18, 30, 45), region = c("SA", "C","N")), n = 3) mypred4 @ I've invested quite a bit of effort to make sure this works dependably with complicated regression formulae. All formulae, such as y \textasciitilde{} x1 + log(x2 + alpha) + ploy(x3, d) will just work. (In case one is interested to know \emph{why} this works, the secret recipe is a new function called model.data(). Under the hood, this required some hard work, a frustrating chain of trial and error that is discussed in the vignette \emph{Rchaeology}, which is distributed with this package). The \code{predictOMatic()} function will work when the regression model provides a predict function that can handle the newdata argument. If the regression package does not provide such a function, the user should step back and use the newdata function to create the examplar predictor data frames and then calculate predicted values manually. The newdata() function will set all variables to center values and then it will create a ``mix and match'' combination of the ones that the user asks for. This sequence shows ways to ask for various ``mix and match'' combinations of values from age and region. <>= mynewdf <- newdata(m1, predVals = c("age","region"), n = 3) mynewdf @ <>= mynewdf2 <- newdata(m1, predVals = list(age = "std.dev.", region = c("SA", "C","N"))) mynewdf2 @ <>= mynewdf3 <- newdata(m1, predVals = list(age = c(20, 30, 40), region = c("SA", "C","N"))) mynewdf3 @\\ Of course, functions from rockchalk or any other R package can be placed into the process for choosing focal values. <>= mynewdf <- newdata(m1, predVals = list(age = getFocal(Chile$age, n = 3), region = getFocal(Chile$region, n = 3))) mynewdf @\\ The function getFocal() is a generic function; it will receive variables of different types and ``do the right thing.'' By default, focal values of numeric variables are quantile values. For factor values, the most frequently observed values are selected. These are customizable, as explained in the documentation. The \code{newdata()} output can be used in a predict() function call as demonstrated above. It would be nice if every regression model's predicted values were accompanied by 95\% confidence intervals. Models fit by \code{lm()} can supply confidence intervals, but not \code{glm()}. At the current time, there are many competing methods that might be used to calculate those intervals; predict.glm() in R's stats package avoids the issue entirely by not calculating intervals. In rockchalk-1.8, I crafted some code to calculate confidence intervals for glm objects using the (admittedly crude) Wald-based approximation. In the scale of the linear predictor, we calculate a 95\% confidence interval, and then use the inverse link function to transform that onto the space of the observed response. In this example, I replicate an example that is an R classic, from the help page of predict.glm. The reader will note that the output includes a warning about the construction of the confidence interval. <>= df <- data.frame(ldose = rep(0:5, 2), sex = factor(rep(c("M", "F"), c(6, 6))), SF.numdead = c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16)) df$SF.numalive <- 20 - df$SF.numdead budworm.lg <- glm(cbind(SF.numdead, SF.numalive) ~ sex*ldose, data = df, family = binomial) predictOMatic(budworm.lg, predVals = c(ldose = "std.dev.", sex = "table"), interval = "confidence") @ \subsection{Descriptive Tables} In regression-based reports, it is common for authors to present a brief summary table to represent the data. The \code{descriptiveTable()} function was created to make that process easier and more predictable. The aim is to provide the usual summary statistics for the numeric predictors and a table of proportions for the categorical predictors. See Table \ref{tab:descriptiveTable}. The user can customize the requested statistical summaries for numeric predictors. \begin{table}[H] \caption{The descriptiveTable function\label{tab:descriptiveTable}} <<>>= descriptiveTable(m1, digits=6) @ \end{table} \noindent I my opinion, the output is, well, simple and understandable. More importantly, it will be sufficient for a term paper, once it is converted into a presentable final format. It can be converted into HTML or LaTeX with various R packages, such as xtable. Note that because this descriptive table is derived from the fitted regression, it summarizes \code{only the cases} that were actually used in the regression analysis. If cases were lost by listwise deletion, the deleted cases are not taken into account here. \section{Better Regression Tables: Some outreg Examples.} On May 8, 2006, Dave Armstrong, who was a political science PhD student at University of Maryland, posted a code snippet in r-help that demonstrated one way to use the ``cat'' function from R to write \LaTeX{} markup. That gave me the idea to write a \LaTeX{} output scheme that would help create some nice looking term and research papers. I wanted ``just enough'' information, but not too much. Since 2006, many new R packages have been introduced for the creation of regression tables, but I still prefer to maintain \code{outreg}. I fight to keep this simple, but have added some features in response to user requests. For me, the biggest question is whether the format should be ``wide'' or ``tight''. A tight format stacks estimates and standard errors into a single column, while a wide format has them in separate columns. In response to user requests, stars for statistical significance were made customizable. It is now allowed to directly insert vectors of parameter estimates and standard errors. In this version, a new parameter \code{centering} is introduced for decimal-centered columns. In the following, I will demonstrate some tables for lm and glm fits on a simulated data set. This new simulation function, genCorrelatedData3(), is a convenient way to create multiple-predictor regression data sets of arbitrary size and complexity, allowing for interactions and nonlinearity. <>= set.seed(1234) dat <- genCorrelatedData3(N = 100, means = c(x1 = 0, x2 = 10, x3 = 0), sds = c(1, 2, 1), rho = c(0, 0, 0), stde = 10, beta = c(0, -3, 4, 0), verbose = FALSE) @\\ That creates a new matrix with variables $x1,$ $x2$, $x3$, and y. We run some linear regressions and then create a categorical output variable for a logistic regression. <>= m1 <- lm(y ~ x1 + x2, data = dat) m2 <- lm(y ~ x2, data = dat) m3 <- lm(y ~ x1 + x2 + x3, data = dat) ## Create categorical variant myilogit <- function(x) exp(x)/(1 + exp(x)) dat$y3 <- rbinom(100, size = 1, p = myilogit(scale(dat$y))) gm1 <- glm(y3 ~ x1 + x2, data = dat) @ The outreg examples are offered in Tables \ref{tab:Tab1} through \ref{tab:Combined-OLSGLM}. Table \ref{tab:Tab1} is the ``tight format output for three models, obtained from \code{outreg(list(m1, m2, m3), centering=\textquotedbl siunitx\textquotedbl )}. On the other hand, as illustrated in Table \ref{tab:Tab2}, the author can request a wide table (tight = FALSE) and provide more elaborate model labels and adjust the significance stars. In Table \ref{tab:Combined-OLSGLM}, observe that the linear and generalized linear model output peacefully co-exist, side-by-side. This output is, in my opinion, completely acceptable for inclusion in a professional presentation or conference paper. There are some warts in this output. The default output from \code{outreg} will have left-aligned columns. Because users requested decimal-centered columns, the argument centering has been introduced. If \code{centering=\textquotedbl siunitx\textquotedbl} or \code{centering=TRUE}, the \code{outreg} table will be adjusted to use features in the LaTeX package ``\code{siunitx}''. The LaTeX package dcolumn is probably more familiar to users, and \code{\textquotedbl dcolumn\textquotedbl} is also a legal value for centering, but the results are not quite as good. As a \textbf{caution}, I hasten to mention that if a user asks for decimal-centered columns, the user has the duty to insert into the document preamble either ``\code{\textbackslash usepackage\{siunitx\}}'' or ``\code{\textbackslash usepackage\{dcolumn\}}''. \begin{table} \caption{Decimal-centered columns in the ``tight'' style\label{tab:Tab1}} <>=" or <- outreg(list(m1, m2, m3), centering = "siunitx") @ \end{table} \begin{table} \caption{The wide format obtained with tight = FALSE\label{tab:Tab2}} <>= or1 <- outreg(list("The First Model with a Long Title" = m1, "Another Model" = m3), tight = FALSE, alpha = c(0.05, 0.01, 0.001), centering = "siunitx") @ \end{table} \begin{table} \caption{Combined OLS and GLM Estimates\label{tab:Combined-OLSGLM}} <>= or2 <- outreg(list(m1,gm1), modelLabels = c("OLS:y","GLM: Categorized y")) @ \end{table} I understand that some authors need to include regression tables in documents that are not prepared with LaTeX. The parameter type = ``HTML'' will change the output format to HTML. An HTML file can be displayed in a web browser and it can be imported in traditional ``word processor'' programs. \section{Plotting Regressions with Interactions } \subsection{Interaction in Linear Regression. } One of the most fundamental skills in regression analysis is the interpretation of interactive predictors. It is much easier for students to understand the effect of an interaction if they can create a nice plot to show how the predicted value depends on various values of a predictor. The \code{plotSlopes()} function was introduced in 2010 when I was teaching a large first-year graduate course (more than 50 students) and it became apparent that about 20 percent of them would not be able to manage the R coding required to draw several lines on a single plot. Unfortunately, R's termplot() function will not draw regressions involving interactions. The rockchalk package has two functions to help with this, \code{plotSlopes()} and \code{plotCurves()}. \code{plotCurves()} is more general, it can handle any kind of formula that the user estimates. \code{plotSlopes()} is more limited, it is only for lm objects. In return for that limitation, \code{plotSlopes()} creates an output object which can be used to conduct post-hoc hypothesis tests. \begin{figure} <>= m1ps <- plotSlopes(m1, plotx = "x2", xlab = "x2 from model m1", interval = "confidence", opacity = 80, col = "red", ylim = c(-70, 15), legendArgs = list(x="topright")) @ \includegraphics[height=4in]{rockchalk-ps05} \caption{plotSlopes: Linear Model with Confidence Interval\label{fig:ps05}} \end{figure} At its most elementary level, \code{plotSlopes()} is a ``one step'' regression line plotter. If the regression model includes more than one predictor, then a single predictor is displayed on the horizontal axis and the other predictors are set on their central values. A plot for the model m1, that was illustrated above, is presented in Figure \ref{fig:ps05}. In rockchalk-1.8, new arguments were added to allow the ``see though'' confidence region. The command to generate Figure \ref{fig:ps05} was <>= <> @ \noindent I've adjusted the color and opacity to illustrate the usage of those arguments. The y range is adjusted to make a little extra room for the legend. The \code{plotSlopes()} function is very flexible. All of the label, color, and scale arguments of a plot function are also available. The plotSlopes function also works well if the moderator is a categorical variable. It is important to note that the output object, m1ps, has the information necessary to re-create the plotted line in the form of a \code{newdata} data frame. The first few lines in the \code{newdata} object are <<>>= m1ps$newdata[1:3, ] @ <>= dat$y4 <- 1 + 0.1 * dat$x1 - 6.9 * dat$x2 + 0.5 * dat$x1*dat$x2 + 0.2 * dat$x3 + rnorm(100, m = 0, sd = 10) m4 <- lm(y4 ~ x1*x2 + x3, data = dat) @ \begin{figure} <>= par(mfcol=c(2,1)) m4psa <- plotSlopes(m4, plotx = "x1", modx = "x2", xlab = "x1 is a fun plotx") m4psb <- plotSlopes(m4, plotx = "x2", modx = "x1", modxVals = "std.dev.", xlab = "x2 is plotx", ylim = c(-120, 15), legendArgs = list(x = "topright")) par(mfcol=c(1,1)) @ \caption{plotSlopes Illustrated\label{fig:ps10}} \end{figure} This is more interesting if we fit a regression with an interaction term, such as \inputencoding{latin9}\begin{lstlisting} m4 <- lm(y4 ~ x1*x2 + x3, data = dat) \end{lstlisting} \inputencoding{utf8} \noindent We then ask \code{plotSlopes} to draw the predicted values using one numeric variable as the horizontal axis and values of another variable (a moderator) are set at particular values. Either x1 or x2 can be viewed as the ``moderator'' predictor, the one on which the effect of the other depends. In rockchalk version 1.8, the selection of values of the moderator was generalized, so that the user can specify either a function that selects values, or a vector of values, or the name of an algorithm. The default algorithm will choose quantile values, but Figure \ref{fig:ps10} demonstrates also the ``std.dev.'' divider algorithm. The code to produce that figure was <>= <> @ When \code{modx} is a numeric variable, then some particular values must be selected for calculation of predicted value lines. The \code{modxVals} argument is used to either specify moderator values or an algorithm to select focal values. By default, three hypothetical values of \code{plotx} are selected (the quantiles 25\%, 50\%, and 75\%). If \code{modx} is a factor variable, then the most frequently observed scores will be selected for consideration. The default display will include the regression line as well as color-coded points for the subgroups represented by values of the moderator. \noindent \begin{figure} <>= fourCat <- gl(4,25, labels=c("East","West","South", "Midwest")) dat$x4 <- sample(fourCat, 100, replace = TRUE) dat$y5 <- 1 + 0.1 * dat$x1 + contrasts(dat$x4)[dat$x4, ] %*% c(-1,1,2) + rnorm(100,0, sd=10) m5 <- lm (y5 ~ x1*x4 + x3, data=dat) @ <>= m5psa <- plotSlopes(m5, plotx = "x1", modx = "x4", xlab = "x1 is a Continuous Predictor", xlim = magRange(dat$x1, c(1.2,1))) @ \caption{plotSlopes with a Categorical Moderator\label{fig:ps20}} \end{figure} \noindent \begin{figure} <>= m5psb <- plotSlopes(m5, plotx = "x1", modx = "x4", modxVals = c("West","East"), xlab = "x1 is a Continuous Predictor", xlim=magRange(dat$x1, c(1.2,1)), interval = "conf") @ \caption{plotSlopes: the interval argument\label{fig:ps21}} \end{figure} Suppose we have a four-valued categorical variable, ``West'',''Midwest'', ``South'', and ``East''. If that variable is used in an interaction in the regression model, then the \code{plotSlopes} output will include four lines, one for each region. For example, consider Figure \ref{fig:ps20}, which is created by <>= <> @ \noindent The categorical variable is x4. It is possible to superimpose confidence intervals for many subgroups, but sometimes these plots start to look a little bit ``busy''. The mixing of shades in overlapping intervals may help with that problem. A plot that focuses on just two subgroups is presented in Figure \ref{fig:ps21}, which is produced by <>= <> @ In rockchalk version 1.8, I've exerted quite a bit of effort to make sure that colors are chosen consistently when users remove or insert groups in these plots. The same value of the moderator should always be plotted in the same way–the line, points, and interval colors should not change. Note, for example, in Figures \ref{fig:ps20} and \ref{fig:ps21}, the line for East is black in both plots, while the line for West is red in both. \subsection{testSlopes, a companion of plotSlopes} The students in psychology and political science are usually interested in conducting some diagnostic analysis of the interactive terms. \citet{Aiken1991} (and later \citealp{Cohen2002}Cohen, Cohen, West, and Aiken) propose using the t test to find out if the effect of the ``plotx'' variable is statistically significantly different from zero for each particular value of the moderator variable. The new version of rockchalk declares a method plot.testSlopes that handles the work of plotting the interaction. The usual case would be the following. First, carry out the regression analysis. Then run plotSlopes, and run testSlopes, and then pass that result object to the plot method. \begin{figure} <>= m4psats <- testSlopes(m4psa) plot(m4psats) @ \caption{testSlopes for an Interactive Model\label{fig:ts10}} \end{figure} \inputencoding{latin9}\begin{lstlisting} m4 <- lm(y ~ x1*x2 + x3, data = dat) m4ps <- plotSlopes(m4, plotx = "x1", modx ="x2", xlab = "x1 is a Continuous Predictor") m4psats <- testSlopes(m4ps) plot(m4psats) \end{lstlisting} \inputencoding{utf8} \noindent The output from testSlopes will differ, depending on whether modx is numeric or a factor. If it is a factor, then the slope of the lines and the associated t-test for each will be presented. My psychology students call these ``simple-slopes'', following the terminology of Aiken and West. The general idea is that we want to know if the ``combined'' effect of plotx is not zero. For a model stated with predictors $plotx_{i}$ and $modx_{i}$as \begin{equation} y_{i}=b_{0}+b_{plotx}plotx_{i}+b_{modx}modx_{i}+b_{plotx:modx}plotx_{i}\cdot modx_{i}+\ldots+e_{i} \end{equation} \noindent the null hypothesis would be \noindent \begin{equation} H_{0}:0=\hat{b}_{simple\,slope}=\hat{b}_{plotx}+\hat{b}_{plotx:modx}modx \end{equation} \noindent If modx is a factor, then we simply calculate the slope of each line and the test is straight-forward. If modx is a numeric variable, then we confront a problem that is a bit more interesting. We don't really want to say that the simple slope is different from 0 for particular values of $modx$, but instead we want to answer the question, ``for which values of the moderator would the effect of the plotx variable be statistically significant?''. This necessitates the calculation of the so-called Johnson-Neyman interval (\citeyear{JohnsonNeyman1936}), a plot of which is presented in Figure \ref{fig:ts10}. The method of calculation is outlined in \citet{PreacherCurren2006}. The values of $modx$ associated with a statistically significant effect of $plotx$ on the outcome is determined from the computation of a T statistic for $\hat{b}_{simple\,slope}$. The J-N interval is the set of values of $modx$ for which the following (quadratic equation) holds: \begin{align} \hat{t} & =\frac{\hat{b}_{simple\,slope}}{std.err(\hat{b}_{simple\,slope})}\nonumber \\ & =\frac{\hat{b}_{simple\,slope}}{\sqrt{\widehat{Var(\hat{b}_{plotx})}+modx^{2}\widehat{Var(\hat{b}_{plotx\cdot modx})}+2modx\widehat{Cov(\hat{b}_{plotx},\hat{b}_{plotx\cdot modx})}}}\geq T_{\frac{\alpha}{2},df} \end{align} \noindent Suppose there are two real roots, $root1$ and $root2$. The values of $modx$ for which the slope is statistically significant may be a compact interval, $[root1,root2]$, as demonstrated in Figure \ref{fig:ts10}, or it may two open intervals, $(-\infty,root1]$ and $[root2,\infty)$. I had expected almost all applications to result in that latter case, but the somewhat surprising case illustrated in Figure \ref{fig:ts10} is not too infrequently observed. \subsection{plotCurves for nonlinear predictor formulae} In the most recent revision of this package, I believe I have made the plotCurves() function redundant. Before I delete the function entirely, I'm waiting for feedback. There was a problem in the past. Some students used plotSlopes() (which drew straight lines) when they intended to draw curves (and should have used plotcCurves). plotCurves() was developed to generalize the plotting capability for regression formulas that include nonlinear transformations. It was for models that have polynomials or terms that are logged (or otherwise transformed). In that sense, plotCurves() is rather similar to R's own termplot() function, but plotCurves() has two major advantages. First, it allows interactions, and second, it handles some complicated formulae that termplot() is not able to manage. Suppose a dependent variable y5 is created according to a nonlinear process. \begin{equation} y5_{i}=-3x1_{i}+7*log(x2)+1.1x2_{i}+2.2x1_{i}\times x2_{i}+e_{i} \end{equation} <>= dat$y5 <- with(dat, -3*x1 + 15*log(0.1 + x2 - min(x2)) + 1.1*x2 + 8.2 *x1 * x2 + 10*rnorm(100)) @ The estimated model is <<>>= m5 <- lm(y5 ~ log(x2) + x1 * x2, data = dat) @ In the new version, the function calls for plotSlopes() and plotCurves() are the same: <>= m5pc <- plotCurves(m5, plotx = "x2", modx = "x1", main = "plotCurves output", ylim = c(-210, 330), legendArgs = list(x = "topleft", title = "x1", cex = 0.8)) @ <>= m5ps <- plotSlopes(m5, plotx = "x2", modx = "x1", main = "plotSlopes outout", ylim = c(-210, 330), legendArgs = list(x = "topleft", title = "x1", cex = 0.8)) @ So far as I have been able to tell, the results are the same. See Figure \ref{fig:pcps20}. \begin{figure} <>= <> @ <>= <> @ \caption{Illustrations of m5 with plotSlopes and plotCurves\label{fig:pcps20}} \end{figure} \subsection{plotPlane} The persp() function in R works well, but its interface is too complicated for most elementary and intermediate R users. To facilitate its use for regression users, the plotPlane() function is offered. The plotPlane function offers a visualization of the mutual effect of two predictors, whether or not the regression model is linear. plotPlane() is designed to work like plotCurves(), to tolerate nonlinear components in the regression formula. plotPlane() allows the depiction of a 3 dimensional curving plane that ``sits'' in the cloud of data points. The variables that are not explicitly pictured in the plotPlane() figure are set to central reference values. Recall model m4, which used the formula \inputencoding{latin9}\lstinline!y4 ~ x1*x2 + x3!\inputencoding{utf8}. As illustrated in Figure \ref{fig:pp100}, plotCurves() presents a reasonable view of the predicted values. \begin{figure} <>= p100 <- plotPlane(m4, plotx1 = "x1", plotx2 = "x2", phi = 10, theta = -80, lcol = gray(.70)) @ \caption{plotPlane for the Interactive Model m4\label{fig:pp100}} \end{figure} Because plotPlane() is a simple convenience wrapper for R's persp() function, it responds to the same customizing arguments that perp would allow. The arguments phi and theta will rotate the figure, for example. The output in Figure \ref{fig:pp100} is produced by the following. <>= <> @ \noindent One of the major educational benefits of the 3-D figure is that students can easily see that a model with a simple interaction effect is \emph{not a linear model any more. }We will return to that point in the discussion of mean centering in regression analysis. Recall that model m5 is the rather complicated nonlinear formula \inputencoding{latin9}\lstinline!log(x2*x2) + x1 * x2!\inputencoding{utf8}. The plotCurves() output for that was already presented in Figure \ref{fig:pcps20}. The three dimensional view of the same is presented in Figure \ref{fig:pp110}, but with an added twist. The twist is that the predicted value lines from the 2-D plot functions can be superimposed on the plane. The function addLines() does the work for translating 2-D plot object onto the regression plane. \begin{figure} <>= ppm5 <- plotPlane(m5, plotx1 = "x2", plotx2 = "x1", phi = 0, npp = 15, lcol = gray(.80)) addLines(from = m5pc, to = ppm5, col = m5pc$col) @ \caption{Making plotSlopes and plotPlane Work Together\label{fig:pp110}} \end{figure} \section{Standardized, Mean-Centered, and Residual-Centered Regressions } \subsection{Standardized regression} Many of us learned to conduct regression analysis with SPSS, which reported both the ordinary (unstandardized) regression coefficients as well as a column of ``beta weights'', the output of a ``standardized'' regression analysis. Each variable, for example $x1_{i}$, was replaced by an estimated $Z-score:$ $(x1_{i}-\overline{x1})/std.dev.(x1_{i}$). Some people think these coefficients are easier to interpret (but, for a strong cautionary argument against them, see \citealp{King1986}). R offers no such thing as standardized regression, probably because this practice is thought to be mistaken. The automatic standardization of all predictors, no matter whether they are categorical, interaction terms, or transformed values (such as logs) is dangerous. To illustrate that, the rockchalk introduces a function called standardize(). Each column of the design matrix is scaled to a new variable with mean 0 and standard deviation 1. The result from standardize() will be an R lm object, which will respond to any follow-up analysis commands. For example: <<>>= m4 <- lm (y4 ~ x1 * x2, data = dat) m4s <- standardize(m4) @ I doubt that a reasonable person would actually want a standardized regression and have tried to warn users in the output. <<>>= summary(m4s) @ \begin{table} \caption{Comparing Ordinary and Standardized Regression\label{tab:stdreg10}} <>= or10 <- outreg(list(m4, m4s), tight = F, modelLabels = c("Not Standardized","Standardized")) @ \end{table} \subsection{Mean-centered Interaction Models} Sometimes people will fit a model like this \begin{equation} y_{i}=b_{o}+b_{1}x1_{i}+b_{2}x2_{i}+e_{i} \end{equation} \noindent and then wonder, ``is there an interaction between $x1_{i}$ and $x2_{i}$?'' The natural inclination is to run this model, \inputencoding{latin9}\begin{lstlisting} m1 <- lm(y ~ x1*x2) \end{lstlisting} \inputencoding{utf8} \noindent or its equivalent \inputencoding{latin9}\begin{lstlisting} m2 <- lm(y ~ x1 + x2 + x1:x2) \end{lstlisting} \inputencoding{utf8} Researchers have been advised that they should not run the ordinary interaction model without ``mean-centering'' the predictors (Aiken and West, 1991). They are advised to replace $x1_{i}$ with $(x1_{i}-\overline{x1})$ and $x2_{i}$ with $(x2_{i}-\overline{x2})$, so that the fitted model will \begin{equation} y_{i}=b_{o}+b_{1}(x1_{i}-\overline{x1})+b_{2}(x2_{i}-\overline{x2})+b_{3}(x1_{i}-\overline{x1})(x2_{i}-\overline{x2})+e_{i} \end{equation} This is a little tedious to do in R, so I provide a function meanCenter() that can handle the details. meanCenter() will receive a model, scan it for interaction terms, and then center the variables that are involved in interactions. We previously fit the model m4, and now we center it. <<>>= m4mc <- meanCenter(m4) summary(m4mc) @ By default, meanCenter() will only center the variables involved in an interaction, and it leaves the others unchanged. The user can request a different treatment of the variables. Version 1.8 introduces the argument ``terms'', which allows the user to list the names of the predictors that should be centered. If the user wants all of the numeric predictors to be mean-centered, the usage of the argument centerOnlyInteractors would be appropriate: \inputencoding{latin9}\begin{lstlisting} m4mc <- meanCenter(m4, centerOnlyInteractors = FALSE) \end{lstlisting} \inputencoding{utf8} By default, it does not standardize while centering (but the user can request standardization with the argument standardize = TRUE. The option centerDV causes the dependent variable to be centered as well. \subsection{Residual-centered Models} Residual-centering \citep{LittleBovaird2006} is another adjustment that has been recommended for models that include interactions or squared terms. Like mean-centering, it is often recommended as a way to obtain smaller standard errors or to make estimates more numerically stable. Like mean centering, it causes a superficial change in the estimated coefficients, but the predicted values and the regression relationship is not actually changed. Nothing of substance is altered. The residualCenter() function is used in the same manner as meanCenter(). The user fits an interactive model and the result object is passed to residualCenter() like so: <<>>= m4rc <- residualCenter(m4) summary(m4rc) @ I would explain residual-centering as follows. Suppose we fit the linear model, with no interaction (note, I'm calling the coefficients $c_{j}$, not $b_{j}$ as usual): \begin{equation} y=c_{0}+c_{1}x1+c_{2}x2+e{}_{i}.\label{eq:rc20} \end{equation} \noindent Let's proceed as if those parameter estimates, $\hat{c}_{1}$, $\hat{c}_{2}$, are the ``right ones'' for our analytical purpose. We'd like to fit an interactive model, but protect the linear parts from fluctuation. In R, when we run the model \inputencoding{latin9}\lstinline!lm(y ~ x1 * x2)!\inputencoding{utf8}, we are allowing all of the coefficients of all variables to fluctuate \begin{equation} y_{i}=b_{o}+b_{1}x1_{i}+b_{2}x2_{i}+b_{3}x1_{i}\times x2_{i}+e_{i}\label{eq:rcwant} \end{equation} Residual centering is one way to stabilize the estimation by assuring that $\hat{b}_{1}=\hat{c}_{1}$ and $\hat{b}_{2}=\hat{c}_{2}$. Only the coefficient $\hat{b}_{3}$ floats freely. One of the reasons that residual-centering is so appealing is that its stabilizing benefit is obtained almost by accident. Here is the gist of the calculations. First, estimate a regression in which the left hand side is the interaction product term: \begin{equation} (x1_{i}\times x2_{i})=d_{0}+d_{1}x1_{i}+d_{2}x2+u_{i}\label{eq:residCentered} \end{equation} \noindent The residuals from that regression are, by definition, orthogonal to both $x1$ and $x2$. Call those fitted residuals $\widehat{u_{i}}$. The we run the interactive regression, replacing the column of the predictor $x1_{i}\times x2_{i}$, with $\widehat{u}_{i}$. That is to say, the model we want, equation (\ref{eq:rcwant}), is estimated as: \begin{equation} y_{i}=b_{0}+b_{1}x1_{i}+b_{2}x2_{i}+b3\widehat{u_{i}}+e_{i},\label{eq:rc10-1} \end{equation} \noindent In essence, we have taken the interaction $(x1_{i}\times x2_{i})$, and purged it of its parts that are linearly related to $x1_{i}$ and $x2_{i}$. rockchalk 1.6 included summary, print, and predict methods for the residual-centered regression objects. It is worth mentioning that the code can handle interactions of arbitrarily many predictors. If the formula has \inputencoding{latin9}\lstinline!lm(y ~ x1 * x2 * x3 * x4)!\inputencoding{utf8}, for example, this implies many separate interactions must be calculated. We need to calculate residual-centered residuals for $x1\cdot x2$, $x1\cdot x3$, $x1\cdot x4$, $x2\cdot x3$ and so forth, and then use them as predictors to get centered estimates of terms $x1\cdot x2\cdot x3$, and then their centered values are predictors four term interactions. Aspiring R programmers who want to learn about programming with R formula objects might benefit from the study of the function residualCenter.R in the rockchalk source code. \section{A Brief Analysis of Mean-Centering} We can put the tools together by making a little detour into the question that seems to plague every regression analyst at one time or another: What does that interaction term \emph{really mean}? Along the way, we will try to dispel the idea that centering somehow makes estimates ``better'' or more numerically precise. The primary advocates of centering as a way to deal with numerical instability are \citet{Aiken1991}, who integrated that advice into the very widely used regression textbook, \emph{Applied Multiple Regression/Correlation for the Behavioral Sciences} \citep{Cohen2002}. They claim that the inclusion of interactions causes ``inessential multicollinearity'' that is alleviated by centering. The advice is widely followed. One statistics book intended for biologists observed, for example, ``We support the recommendation of Aiken \& West (1991) and others that multiple regression with interaction terms should be fitted to data with centered predictor values'' (\citealp{QuinnKeough2002}, Chapter 6). Technical rebuttals have been published already (\citealp{Kromrey1998}), but the matter still seems not widely understood. The argument is not that mean-centering (or residual-centering) is wrong, but rather that it is unnecessary. It is irrelevant, and possibly, misleading. At the core of the matter is the fact that our uncertainty about regression estimates depends on our point of view. Please review the confidence interval in Figure \ref{fig:ps05}. The $y$ axis is not even ``in the picture.'' Would one's appreciation of the regression's predictive line be enhanced if the y axis were moved into the picture? We can move the $y$ axis by centering the predictor variable. Suppose we replace $x_{i}$ with $x_{i}-5$ and then re-estimate the model. That has the simple effect of moving the $y$ axis 5 units to the right. The slope is unchanged, and the reported intercept is changed in a completely superficial way. The predicted value line, the slope estimates, the residual standard error, and so forth, are not changed in any substantial way. This is simply a matter of user convenience. I believe that no reasonable person can say the regression is ``better'' after centering $x_{i}$. However, there is a superficial difference that has deceived many authors. Notice that the confidence interval is hourglass shaped. \emph{At the new $y$ axis}, our prediction is more precise. If we move the y axis further to the right, into the center of the data, say by mean-centering ($x_{i}-10$), we move to the position that allows an even more precise prediction. The estimate of the intercept's standard error will be smaller for the obvious reason. We are not actually gaining certainty, we are simply reporting our most favorable ``snapshot'' of it. The predicted value, and the confidence interval for any observed value of $x_{i}$ is completely unchanged by centering. It should not surprise the reader to learn that mean-centering interactive predictors enhances the standard errors in the same illusory way. Let's work though an example to see why this is so tempting. Suppose the true data generating mechanism is an interaction like so \begin{equation} y_{i}=2+0.1x1_{i}+0.1x2_{i}+0.2\cdot(x1_{i}\times x2_{i})+e_{i}, \end{equation} \noindent where $e_{i}\sim N(0,300^{2})$ and $\rho_{x1,x2}=0.4$. A regression analysis that ignores the interaction, \inputencoding{latin9}\begin{lstlisting} lm(y ~ x1 + x2, data = dat2) \end{lstlisting} \inputencoding{utf8} \noindent is reported in the first column of Table \ref{tab:meancenter10-1}. I've used outreg's new alpha argument to emphasize the ``really good'' estimates with more stars. Notice that everything is ``statistically significant!'' Unable to leave well enough alone, the researcher wonders, ``is there an interaction between $x1$ and $x2$?'' Run \inputencoding{latin9} \begin{lstlisting} lm(y ~ x1 * x2, data = dat2) \end{lstlisting} \inputencoding{utf8} \noindent The second column in Table \ref{tab:meancenter10-1} summarizes that regression. Be prepared for a shock when you scan the estimates. Almost everybody I know has said ``what the heck?'' or ``Holy Cow!'' or ``Oh My God, my great result went to Hell, I'll never get tenure!'' Neither of the key variables is ``statistically significant'' any more. <>= dat2 <- genCorrelatedData3(N=400, rho=.4, stde=300, beta=c(2,0.1,0.1,0.2)) m6linear <- lm (y ~ x1 + x2, data=dat2) m6int <- lm (y ~ x1 * x2, data=dat2) m6mc <- meanCenter(m6int) m6rc <- residualCenter(m6int) @ \begin{table} \caption{Comparing Regressions\label{tab:meancenter10-1}} <>= or11 <- outreg(list(m6linear, m6int, m6mc, m6rc), tight=F, modelLabels=c("Linear", "Interaction","Mean-centered","Residual-centered"), alpha = c(0.05, 0.01)) @ \end{table} Cohen, et al. claim that the apparent instability of the coefficients is a reflection of ``inessential collinearity,'' due to the fact that $x1$ and $x2$ are correlated with the new term, $x1\times x2$. They advised their readers to ``mean-center'' their predictors and run the regression again. Remember the hourglass shape of the confidence interval. By mean-centering, we are re-positioning ourself for a much more favorable snapshot. The welcoming effect of the centered estimates is found in the third column of Table \ref{tab:meancenter10-1}. The point estimates in that snapshot are ``significant again.'' It appears we have ``solved'' the problem of inessential collinearity. The first hint of trouble is in the fact that the coefficients of the interactive effects in columns 2 and 3 are identical. Those coefficients are the same because they are estimates of the cross partial derivative $\partial^{2}y/\partial x1\partial x2$. That particular value is the same, no matter where we position the $y$ axis, as it should be. Note as well that the root mean square and $R^{2}$ estimates are identical. Everything that we expect to remain the same is the same. Except for the fact that the slopes and their hypothesis tests seem better in the centered model, we would think there is nothing interesting here. Here's a puzzle for you. Consider Figure \ref{fig:pscentering}, which shows the predicted values and confidence intervals from the centered and uncentered regressions. Is there any substantively important difference between these two regressions? \begin{figure} <>= par(mfcol = c(2, 1)) plotSlopes(m6int, plotx = "x1", modx = "x2", modxVals = "std.dev.", n = 2, interval = "confidence", main= "Not Centered") plotSlopes(m6mc, plotx = "x1c", modx = "x2c", modxVals = "std.dev.", n = 2, interval = "confidence", main = "Mean Centered") par(mfcol = c(1, 1)) @ \includegraphics[height=8in]{rockchalk-pscenter.pdf} \caption{plotSlopes for the centered and non-centered regressions\label{fig:pscentering}} \end{figure} Perhaps the 2-D plots are not persuasive. Don't stop yet. In the 3-D plots will help quite a bit. We have not yet grasped the most fundamental changed caused by the insertion of the interaction term. When we insert $x1_{i}\times x2_{i}$, we change the fudamental nature of the regression surface. The surface of the fitted model is no longer a ``flat'' plane, but rather it is a curving surface. I've assembled 3-D plots in Figure \ref{fig:mcenter30}. We see that the ordinary interaction, mean-centered, and residual-centered models produce identical predicted values! The only difference in the figures is that the axes of the predictors have been re-scaled, so that the $y$ axis is (implicitly) re-positioned. Now that we understand the situation, we could play around with the data and move the axis back and forth until we arrive at a position that is most favorable to our interpretation. \begin{figure} <>= op <- par(no.readonly = TRUE) par(mfcol=c(2,2)) par(mar=c(2,2,2,1)) plotPlane(m6linear, plotx1="x1", plotx2="x2", plotPoints=FALSE, main="Linear", ticktype="detailed") plotPlane(m6int, plotx1="x1", plotx2="x2", plotPoints=FALSE, main="Interaction: Not Centered", ticktype="detailed") plotPlane(m6mc, plotx1="x1c", plotx2="x2c", plotPoints=FALSE, main="Mean-centered", ticktype="detailed") plotPlane(m6rc, plotx1="x1", plotx2="x2", plotPoints=FALSE, main="Residual-centered", ticktype="detailed") par(op) @ \caption{Predicted Planes from Centered and Uncentered Fits Identical\label{fig:mcenter30}} \end{figure} The regression coefficient estimates are snapshots, each summarizing the curvature at one particular point in a curving surface. It seems quite apparent in Figure \ref{fig:mcenter30} that the models are identical, and yet we receive different regression reports from different spots. The non-centered data offers us the slope estimate from the ``front-left'' part of the graph. Mean-centered estimates report on the slope in the middle of the graph. In the rockchalk examples folder, one can find a file called ``centeredRegression.R'' that walks through this argument step by step. What about residual-centering? Because the transformation that it employs is more abstract, I initially thought it was actually a different model. And yet it is not. The residual-centered model is completely equivalent to the ordinary interaction model and the mean-centered model. For a given combination of the input values, the predicted values and confidence intervals are the same. The predicted values of the ordinary interactive model, the mean-centered model, and the residual-centered models are illustrated in Figure \ref{fig:rcenter40}. \begin{figure} <>= dat3 <- centerNumerics(dat2) ##m6mcpred <- fitted(m6mc) ## m6mcpred <- predict(m6mc, newdata=dat3) ##m6rcpred <- fitted(m6rc) ## m6rcpred <- predict(m6rc, newdata=dat3) ##m6intpred <- fitted(m6int) ## m6intpred <- predict(m6int, newdata=dat3) op <- par(no.readonly = TRUE) par(mfcol=c(1,2)) plot(m6intpred, m6rcpred, main="", xlab="Predictions of Uncentered Interaction", ylab="Residual-centered Predictions") predcor <- round(cor(m6intpred, m6rcpred),3) legend("topleft", legend=c(paste0("Correlation=", predcor))) plot(m6mcpred, m6rcpred, main="", xlab="Mean-centered Predictions", ylab = "Residual-centered Predictions") predcor <- round(cor(m6mcpred, m6rcpred),3) legend("topleft", legend=c(paste0("Correlation=", predcor))) par(op) @ \includegraphics[width = 6.5in]{rockchalk-rcenter40} \caption{Predicted Values of Mean and Residual-centered Models\label{fig:rcenter40}} \end{figure} The take-away point from this is that there is no free lunch in regression analysis. If re-scaling a variable by adding or subtracting a constant seems to change a result, one should be cautious and suspect an error. In order to drive the point home, I'd like to show that it is possible to translate between the estimates of any one of these fitted models and the estimates of the others. The ordinary model is \begin{equation} y_{i}=b_{0}+b_{1}x1_{i}+b_{2}x2_{i}+b_{3}(x1_{i}\times x2_{i})+e1_{i}\label{eq:int} \end{equation} The mean-centered model is \begin{equation} y_{i}=c_{0}+c_{1}(x1_{i}-\overline{x1})+c_{2}(x2_{i}-\overline{x2})+c_{3}(x1_{i}-\overline{x1})\cdot(x2_{i}-\overline{x2})+e2_{i}\label{eq:mc1} \end{equation} In order to compare with equation \ref{eq:int}, we would re-arrange like so \begin{equation} y_{i}=c_{0}+c_{1}(x1_{i})-c_{1}\overline{x1}+c_{2}(x2_{i})-c_{2}\overline{x2}+c_{3}(x1_{i}x2_{i}+\overline{x1}\overline{x2}-\overline{x1}x2_{i}-\overline{x2}x1_{i})+e2_{i} \end{equation} \begin{equation} y_{i}=c_{0}+c_{1}(x1_{i})-c_{1}\overline{x1}+c_{2}(x2_{i})-c_{2}\overline{x2}+c_{3}(x1_{i}x2_{i})+c_{3}\overline{x1}\overline{x2}-c_{3}\overline{x1}x2_{i}-c_{3}\overline{x2}x1_{i})+e2_{i} \end{equation} \begin{equation} y_{i}=\{c_{0}-c_{1}\overline{x1}-c_{2}\overline{x2}+c_{3}\overline{x1}\overline{x2}\}+\{c_{1}-c_{3}\overline{x2}\}x1_{i}+\{c_{2}-c_{3}\overline{x1}\}x2_{i}+c_{3}(x1_{i}x2_{i})+e2_{i}\label{eq:mc3} \end{equation} One can then compare the parameter estimates from equations \ref{eq:int} and \ref{eq:mc3}. Both \ref{eq:int} and \ref{eq:mc3} include a single parameter times $(x1_{i}x2_{i}),$ leading one to expect that the estimate $\hat{b}_{3}$ should be equal to the estimate of $\hat{c}_{3}$, and they are! Less obviously, one can use the fitted coefficients from either model to deduce the fitted coefficients from the other. The following equalities describe that relationship. \begin{eqnarray} \hat{b}_{0} & = & \hat{c}_{0}-\hat{c}_{1}\overline{x1}-\hat{c}_{2}\overline{x2}+\hat{c}_{3}\overline{x1}\overline{x2}\\ \hat{b}_{1} & = & \hat{c}_{1}-\hat{c}_{3}\overline{x2}\\ \hat{b}_{2} & = & \hat{c}_{2}-\hat{c}_{3}\overline{x1}\\ \hat{b}_{3} & = & \hat{c}_{3} \end{eqnarray} The estimated fit of equation \ref{eq:mc1} would provide estimated coefficients $\hat{c}_{j}$, $j=0,...,3$, which would then be used to calculate the estimates from the non-centered model. The estimation of the residual-centered model requires two steps. The residual from this regression model \noindent \begin{equation} \widehat{(x1_{i}\times x2_{i})}=\hat{d}_{0}+\hat{d}_{1}x1_{i}+\hat{d}_{2}x2_{i}. \end{equation} \noindent is \begin{equation} \widehat{u}_{i}=(x1_{i}\times x2_{i})-\widehat{(x1_{i}\times x2_{i})}, \end{equation} \noindent which is inserted into equation \ref{eq:int}. \begin{equation} y_{i}=h_{0}+h_{1}x1_{i}+h_{2}x2_{i}+h_{3}\{x1_{i}\times x2_{i}-\widehat{x1_{i}\times x2_{i}}\}+e3_{i}\label{eq:rc1} \end{equation} Replacing $\widehat{x1_{i}\times x2_{i}}$ with $\hat{d}_{0}+\hat{d}_{1}x1_{i}+\hat{d}_{2}x2_{i}$, \ref{eq:rc1} becomes \begin{eqnarray} y_{i} & = & h_{0}+h_{1}x1_{i}+h_{2}x2_{i}+h_{3}\{x1_{i}\times x2_{i}-\hat{d}_{0}-\hat{d}_{1}x1_{i}-\hat{d}_{2}x2_{i}\}+e3_{i}\\ & = & h_{0}+h_{1}x1_{i}+h_{2}x2_{i}+h_{3}\{x1_{i}\times x2_{i}\}-h_{3}\hat{d}_{0}-h_{3}\hat{d}_{1}x1_{i}-h_{3}\hat{d}_{2}x2_{i}\}+e3_{i}\\ & & \{h_{0}-h_{3}\hat{d}_{0}\}+\{h_{1}-h_{3}\hat{d}_{1}\}x1_{i}+\{h_{2}-h_{3}\hat{d}_{2}\}x2_{i}+h_{3}\{x1_{i}\times x2_{i}\}+e3_{i} \end{eqnarray} As in the previous comparison, we can translate coefficient estimates between the ordinary specification and the residual-centered model. The coefficient estimated for the product term, $\hat{h}_{3}$, should be equal to $\hat{b}_{3}$ and $\hat{c}_{3}$ (and it is!). If we fit the residual centered model, \ref{eq:rc1}, we can re-generate the coefficients of the other models like so: \begin{eqnarray} \hat{b}_{0} & =\hat{c}_{0}-\hat{c}_{1}\overline{x1}-\hat{c}_{2}\overline{x2}+\hat{c}_{3}\overline{x1}\overline{x2}= & h_{0}-h_{3}\hat{d}_{0}\\ \hat{b}_{1} & =\hat{c}_{1}-\hat{c}_{3}\overline{x2}= & h_{1}-h_{3}\hat{d}_{1}\\ \hat{b}_{2} & =\hat{c}_{2}-\hat{c}_{3}\overline{x1}= & h_{2}-h_{3}\hat{d}_{2} \end{eqnarray} From the preceding, it should be clear enough that the three models are equivalent. \section{Conclusion} The rockchalk package is offered as a system of support for teachers and students of regression analysis. It should help with the preparation of plots, summary tables, and other diagnostics. A number of functions are currently offered in this package that have not been emphasized in this writeup. I would draw the reader to the help pages for these functions \begin{description} \item [{combineLevels}] a recoder for factor variables \item [{mcDiagnose}] a one stop shop for multicollinearity diagnostic information \item [{getDeltaRsquare}] calculate the change in the $R^{2}$ that results from the omission of each variable. This is the squared semi-partial correlation coefficient. \item [{getPartialCor}] calculates the partial correlation matrix of the predictors in a regression model \item [{lazyCor~and~lazyCov}] convenient ways to create correlation and covariance matrices that are needed in many simulation exercises \item [{mvrnorm}] a slightly improved version of the MASS package's multivariate normal generator. \end{description} \bibliographystyle{chicago} \bibliography{rockchalk} \end{document}