Structuring code that is robust to updates in the data, changes in
methodological specifications, etc., and can get to presentation-ready
results in an automated way can mitigate errors caused by painful,
tedious tasks–it can also be a huge time saver. The cheese
package was designed with this philosophy in mind, heavily influenced by
(and explicit dependencies on) tidy concepts. Its
intention is to provide general tools for working with data and results
during statistical analysis, in addition to a collection of functions
designated for specific statistical tasks.
Let’s take a closer look:
## Loading required package: cheeseThis data comes from the UCI Machine Learning Repository, containing a collection of demographic and clinical characteristics from 303 patients. It was subsequently processed and cleaned into a format suitable for analysis–details of which can be found here.
## # A tibble: 303 × 9
##      Age Sex    ChestPain           BP Cholesterol BloodSugar MaximumHR
##    <dbl> <fct>  <fct>            <dbl>       <dbl> <lgl>          <dbl>
##  1    63 Male   Typical angina     145         233 TRUE             150
##  2    67 Male   Asymptomatic       160         286 FALSE            108
##  3    67 Male   Asymptomatic       120         229 FALSE            129
##  4    37 Male   Non-anginal pain   130         250 FALSE            187
##  5    41 Female Atypical angina    130         204 FALSE            172
##  6    56 Male   Atypical angina    120         236 FALSE            178
##  7    62 Female Asymptomatic       140         268 FALSE            160
##  8    57 Female Asymptomatic       120         354 FALSE            163
##  9    63 Male   Asymptomatic       130         254 FALSE            147
## 10    53 Male   Asymptomatic       140         203 TRUE             155
## # ℹ 293 more rows
## # ℹ 2 more variables: ExerciseInducedAngina <fct>, HeartDisease <fct>The functions that will be introduced are roughly in order of their
development, as many build off of one another. Selection of columns is
done with non-standard evaluation (NSE) or with
tidyselect::select_helpers, where applicable.
divide() is used to split up a data frame into a list of
subsets. Suppose you want to split the example data set by sex and heart
disease status:
## $Female
## $Female$No
## # A tibble: 72 × 7
##      Age ChestPain     BP Cholesterol BloodSugar MaximumHR ExerciseInducedAngina
##    <dbl> <fct>      <dbl>       <dbl> <lgl>          <dbl> <fct>                
##  1    41 Atypical …   130         204 FALSE            172 No                   
##  2    57 Asymptoma…   120         354 FALSE            163 Yes                  
##  3    56 Atypical …   140         294 FALSE            153 No                   
##  4    48 Non-angin…   130         275 FALSE            139 No                   
##  5    58 Typical a…   150         283 TRUE             162 No                   
##  6    50 Non-angin…   120         219 FALSE            158 No                   
##  7    58 Non-angin…   120         340 FALSE            172 No                   
##  8    66 Typical a…   150         226 FALSE            114 No                   
##  9    69 Typical a…   140         239 FALSE            151 No                   
## 10    71 Atypical …   160         302 FALSE            162 No                   
## # ℹ 62 more rows
## 
## $Female$Yes
## # A tibble: 25 × 7
##      Age ChestPain     BP Cholesterol BloodSugar MaximumHR ExerciseInducedAngina
##    <dbl> <fct>      <dbl>       <dbl> <lgl>          <dbl> <fct>                
##  1    62 Asymptoma…   140         268 FALSE            160 No                   
##  2    65 Asymptoma…   150         225 FALSE            114 No                   
##  3    61 Asymptoma…   130         330 FALSE            169 No                   
##  4    51 Asymptoma…   130         305 FALSE            142 Yes                  
##  5    62 Asymptoma…   160         164 FALSE            145 No                   
##  6    60 Asymptoma…   150         258 FALSE            157 No                   
##  7    61 Asymptoma…   145         307 FALSE            146 Yes                  
##  8    43 Asymptoma…   132         341 TRUE             136 Yes                  
##  9    62 Non-angin…   130         263 FALSE             97 No                   
## 10    63 Asymptoma…   150         407 FALSE            154 No                   
## # ℹ 15 more rows
## 
## 
## $Male
## $Male$No
## # A tibble: 92 × 7
##      Age ChestPain     BP Cholesterol BloodSugar MaximumHR ExerciseInducedAngina
##    <dbl> <fct>      <dbl>       <dbl> <lgl>          <dbl> <fct>                
##  1    63 Typical a…   145         233 TRUE             150 No                   
##  2    37 Non-angin…   130         250 FALSE            187 No                   
##  3    56 Atypical …   120         236 FALSE            178 No                   
##  4    57 Asymptoma…   140         192 FALSE            148 No                   
##  5    44 Atypical …   120         263 FALSE            173 No                   
##  6    52 Non-angin…   172         199 TRUE             162 No                   
##  7    57 Non-angin…   150         168 FALSE            174 No                   
##  8    54 Asymptoma…   140         239 FALSE            160 No                   
##  9    49 Atypical …   130         266 FALSE            171 No                   
## 10    64 Typical a…   110         211 FALSE            144 Yes                  
## # ℹ 82 more rows
## 
## $Male$Yes
## # A tibble: 114 × 7
##      Age ChestPain     BP Cholesterol BloodSugar MaximumHR ExerciseInducedAngina
##    <dbl> <fct>      <dbl>       <dbl> <lgl>          <dbl> <fct>                
##  1    67 Asymptoma…   160         286 FALSE            108 Yes                  
##  2    67 Asymptoma…   120         229 FALSE            129 Yes                  
##  3    63 Asymptoma…   130         254 FALSE            147 No                   
##  4    53 Asymptoma…   140         203 TRUE             155 Yes                  
##  5    56 Non-angin…   130         256 TRUE             142 Yes                  
##  6    48 Atypical …   110         229 FALSE            168 No                   
##  7    58 Atypical …   120         284 FALSE            160 No                   
##  8    58 Non-angin…   132         224 FALSE            173 No                   
##  9    60 Asymptoma…   130         206 FALSE            132 Yes                  
## 10    40 Asymptoma…   110         167 FALSE            114 Yes                  
## # ℹ 104 more rowsThe default behavior is to continually split the data in a
hierarchical structure based on the order of the input columns, and to
remove them from the result. The keep argument can be used
to retain the column(s) in the data frames.
fasten() allows you to take a list structure that
contains data frames at an arbitrary depth, and roll them back up into a
single data frame. This is useful when a statistical process needs to be
mapped to many subsets, and you want to be able to easily collect the
results without repeatedly binding data. You can call the function with
the divided data from above:
## # A tibble: 303 × 7
##      Age ChestPain     BP Cholesterol BloodSugar MaximumHR ExerciseInducedAngina
##    <dbl> <fct>      <dbl>       <dbl> <lgl>          <dbl> <fct>                
##  1    41 Atypical …   130         204 FALSE            172 No                   
##  2    57 Asymptoma…   120         354 FALSE            163 Yes                  
##  3    56 Atypical …   140         294 FALSE            153 No                   
##  4    48 Non-angin…   130         275 FALSE            139 No                   
##  5    58 Typical a…   150         283 TRUE             162 No                   
##  6    50 Non-angin…   120         219 FALSE            158 No                   
##  7    58 Non-angin…   120         340 FALSE            172 No                   
##  8    66 Typical a…   150         226 FALSE            114 No                   
##  9    69 Typical a…   140         239 FALSE            151 No                   
## 10    71 Atypical …   160         302 FALSE            162 No                   
## # ℹ 293 more rowsIt was binded back to the original number of rows, but it lost the
columns it was split by. The into argument can be used to
handle this:
## # A tibble: 303 × 9
##    Sex    HeartDisease   Age ChestPain       BP Cholesterol BloodSugar MaximumHR
##    <chr>  <chr>        <dbl> <fct>        <dbl>       <dbl> <lgl>          <dbl>
##  1 Female No              41 Atypical an…   130         204 FALSE            172
##  2 Female No              57 Asymptomatic   120         354 FALSE            163
##  3 Female No              56 Atypical an…   140         294 FALSE            153
##  4 Female No              48 Non-anginal…   130         275 FALSE            139
##  5 Female No              58 Typical ang…   150         283 TRUE             162
##  6 Female No              50 Non-anginal…   120         219 FALSE            158
##  7 Female No              58 Non-anginal…   120         340 FALSE            172
##  8 Female No              66 Typical ang…   150         226 FALSE            114
##  9 Female No              69 Typical ang…   140         239 FALSE            151
## 10 Female No              71 Atypical an…   160         302 FALSE            162
## # ℹ 293 more rows
## # ℹ 1 more variable: ExerciseInducedAngina <fct>The positions of the into values always lineup with the
level of the list hierarchy. So, for example, if you don’t care about
retaining the heart disease status, you can do this:
## # A tibble: 303 × 8
##    Sex      Age ChestPain           BP Cholesterol BloodSugar MaximumHR
##    <chr>  <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl>
##  1 Female    41 Atypical angina    130         204 FALSE            172
##  2 Female    57 Asymptomatic       120         354 FALSE            163
##  3 Female    56 Atypical angina    140         294 FALSE            153
##  4 Female    48 Non-anginal pain   130         275 FALSE            139
##  5 Female    58 Typical angina     150         283 TRUE             162
##  6 Female    50 Non-anginal pain   120         219 FALSE            158
##  7 Female    58 Non-anginal pain   120         340 FALSE            172
##  8 Female    66 Typical angina     150         226 FALSE            114
##  9 Female    69 Typical angina     140         239 FALSE            151
## 10 Female    71 Atypical angina    160         302 FALSE            162
## # ℹ 293 more rows
## # ℹ 1 more variable: ExerciseInducedAngina <fct>In contrast, if you want to forgo the sex indicator, empty strings will need to be used as placeholders so the names are applied at the correct levels.
## # A tibble: 303 × 8
##    HeartDisease   Age ChestPain           BP Cholesterol BloodSugar MaximumHR
##    <chr>        <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl>
##  1 No              41 Atypical angina    130         204 FALSE            172
##  2 No              57 Asymptomatic       120         354 FALSE            163
##  3 No              56 Atypical angina    140         294 FALSE            153
##  4 No              48 Non-anginal pain   130         275 FALSE            139
##  5 No              58 Typical angina     150         283 TRUE             162
##  6 No              50 Non-anginal pain   120         219 FALSE            158
##  7 No              58 Non-anginal pain   120         340 FALSE            172
##  8 No              66 Typical angina     150         226 FALSE            114
##  9 No              69 Typical angina     140         239 FALSE            151
## 10 No              71 Atypical angina    160         302 FALSE            162
## # ℹ 293 more rows
## # ℹ 1 more variable: ExerciseInducedAngina <fct>Obviously, the classes of the split columns are not retained from the original data since the splitting and binding processes are independent.
As shown above, the default behavior for these functions is to split
or bind “as much as possible”. However, it can be useful to have control
over the shape of the split. This is where the depth
argument comes in. Suppose you want the same data frames at the leaves
of the list, but only one level deep:
## $`Female|No`
## # A tibble: 72 × 7
##      Age ChestPain     BP Cholesterol BloodSugar MaximumHR ExerciseInducedAngina
##    <dbl> <fct>      <dbl>       <dbl> <lgl>          <dbl> <fct>                
##  1    41 Atypical …   130         204 FALSE            172 No                   
##  2    57 Asymptoma…   120         354 FALSE            163 Yes                  
##  3    56 Atypical …   140         294 FALSE            153 No                   
##  4    48 Non-angin…   130         275 FALSE            139 No                   
##  5    58 Typical a…   150         283 TRUE             162 No                   
##  6    50 Non-angin…   120         219 FALSE            158 No                   
##  7    58 Non-angin…   120         340 FALSE            172 No                   
##  8    66 Typical a…   150         226 FALSE            114 No                   
##  9    69 Typical a…   140         239 FALSE            151 No                   
## 10    71 Atypical …   160         302 FALSE            162 No                   
## # ℹ 62 more rows
## 
## $`Male|No`
## # A tibble: 92 × 7
##      Age ChestPain     BP Cholesterol BloodSugar MaximumHR ExerciseInducedAngina
##    <dbl> <fct>      <dbl>       <dbl> <lgl>          <dbl> <fct>                
##  1    63 Typical a…   145         233 TRUE             150 No                   
##  2    37 Non-angin…   130         250 FALSE            187 No                   
##  3    56 Atypical …   120         236 FALSE            178 No                   
##  4    57 Asymptoma…   140         192 FALSE            148 No                   
##  5    44 Atypical …   120         263 FALSE            173 No                   
##  6    52 Non-angin…   172         199 TRUE             162 No                   
##  7    57 Non-angin…   150         168 FALSE            174 No                   
##  8    54 Asymptoma…   140         239 FALSE            160 No                   
##  9    49 Atypical …   130         266 FALSE            171 No                   
## 10    64 Typical a…   110         211 FALSE            144 Yes                  
## # ℹ 82 more rows
## 
## $`Female|Yes`
## # A tibble: 25 × 7
##      Age ChestPain     BP Cholesterol BloodSugar MaximumHR ExerciseInducedAngina
##    <dbl> <fct>      <dbl>       <dbl> <lgl>          <dbl> <fct>                
##  1    62 Asymptoma…   140         268 FALSE            160 No                   
##  2    65 Asymptoma…   150         225 FALSE            114 No                   
##  3    61 Asymptoma…   130         330 FALSE            169 No                   
##  4    51 Asymptoma…   130         305 FALSE            142 Yes                  
##  5    62 Asymptoma…   160         164 FALSE            145 No                   
##  6    60 Asymptoma…   150         258 FALSE            157 No                   
##  7    61 Asymptoma…   145         307 FALSE            146 Yes                  
##  8    43 Asymptoma…   132         341 TRUE             136 Yes                  
##  9    62 Non-angin…   130         263 FALSE             97 No                   
## 10    63 Asymptoma…   150         407 FALSE            154 No                   
## # ℹ 15 more rows
## 
## $`Male|Yes`
## # A tibble: 114 × 7
##      Age ChestPain     BP Cholesterol BloodSugar MaximumHR ExerciseInducedAngina
##    <dbl> <fct>      <dbl>       <dbl> <lgl>          <dbl> <fct>                
##  1    67 Asymptoma…   160         286 FALSE            108 Yes                  
##  2    67 Asymptoma…   120         229 FALSE            129 Yes                  
##  3    63 Asymptoma…   130         254 FALSE            147 No                   
##  4    53 Asymptoma…   140         203 TRUE             155 Yes                  
##  5    56 Non-angin…   130         256 TRUE             142 Yes                  
##  6    48 Atypical …   110         229 FALSE            168 No                   
##  7    58 Atypical …   120         284 FALSE            160 No                   
##  8    58 Non-angin…   132         224 FALSE            173 No                   
##  9    60 Asymptoma…   130         206 FALSE            132 Yes                  
## 10    40 Asymptoma…   110         167 FALSE            114 Yes                  
## # ℹ 104 more rowsYou now have list with four elements containing the subsets–the names
of which are the concatenated levels of the split columns (see the
sep argument).
This argument also works when binding data. Going back to the original divided data frame, you may only want to bind the internal lists.
## $Female
## # A tibble: 97 × 8
##    HeartDisease   Age ChestPain           BP Cholesterol BloodSugar MaximumHR
##    <chr>        <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl>
##  1 No              41 Atypical angina    130         204 FALSE            172
##  2 No              57 Asymptomatic       120         354 FALSE            163
##  3 No              56 Atypical angina    140         294 FALSE            153
##  4 No              48 Non-anginal pain   130         275 FALSE            139
##  5 No              58 Typical angina     150         283 TRUE             162
##  6 No              50 Non-anginal pain   120         219 FALSE            158
##  7 No              58 Non-anginal pain   120         340 FALSE            172
##  8 No              66 Typical angina     150         226 FALSE            114
##  9 No              69 Typical angina     140         239 FALSE            151
## 10 No              71 Atypical angina    160         302 FALSE            162
## # ℹ 87 more rows
## # ℹ 1 more variable: ExerciseInducedAngina <fct>
## 
## $Male
## # A tibble: 206 × 8
##    HeartDisease   Age ChestPain           BP Cholesterol BloodSugar MaximumHR
##    <chr>        <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl>
##  1 No              63 Typical angina     145         233 TRUE             150
##  2 No              37 Non-anginal pain   130         250 FALSE            187
##  3 No              56 Atypical angina    120         236 FALSE            178
##  4 No              57 Asymptomatic       140         192 FALSE            148
##  5 No              44 Atypical angina    120         263 FALSE            173
##  6 No              52 Non-anginal pain   172         199 TRUE             162
##  7 No              57 Non-anginal pain   150         168 FALSE            174
##  8 No              54 Asymptomatic       140         239 FALSE            160
##  9 No              49 Atypical angina    130         266 FALSE            171
## 10 No              64 Typical angina     110         211 FALSE            144
## # ℹ 196 more rows
## # ℹ 1 more variable: ExerciseInducedAngina <fct>Note that the positions of the into values are directly
related to how shallow/deep you want the result to be.
The depth can also be controlled relative to the leaves by using negative integers. This can be useful if it is unclear how deep the list will be (or is). In the example above, suppose you didn’t know how deep the list was, but you knew the heart disease status was the most internal split, and thus wanted to bind it.
## $Female
## # A tibble: 97 × 8
##    HeartDisease   Age ChestPain           BP Cholesterol BloodSugar MaximumHR
##    <chr>        <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl>
##  1 No              41 Atypical angina    130         204 FALSE            172
##  2 No              57 Asymptomatic       120         354 FALSE            163
##  3 No              56 Atypical angina    140         294 FALSE            153
##  4 No              48 Non-anginal pain   130         275 FALSE            139
##  5 No              58 Typical angina     150         283 TRUE             162
##  6 No              50 Non-anginal pain   120         219 FALSE            158
##  7 No              58 Non-anginal pain   120         340 FALSE            172
##  8 No              66 Typical angina     150         226 FALSE            114
##  9 No              69 Typical angina     140         239 FALSE            151
## 10 No              71 Atypical angina    160         302 FALSE            162
## # ℹ 87 more rows
## # ℹ 1 more variable: ExerciseInducedAngina <fct>
## 
## $Male
## # A tibble: 206 × 8
##    HeartDisease   Age ChestPain           BP Cholesterol BloodSugar MaximumHR
##    <chr>        <dbl> <fct>            <dbl>       <dbl> <lgl>          <dbl>
##  1 No              63 Typical angina     145         233 TRUE             150
##  2 No              37 Non-anginal pain   130         250 FALSE            187
##  3 No              56 Atypical angina    120         236 FALSE            178
##  4 No              57 Asymptomatic       140         192 FALSE            148
##  5 No              44 Atypical angina    120         263 FALSE            173
##  6 No              52 Non-anginal pain   172         199 TRUE             162
##  7 No              57 Non-anginal pain   150         168 FALSE            174
##  8 No              54 Asymptomatic       140         239 FALSE            160
##  9 No              49 Atypical angina    130         266 FALSE            171
## 10 No              64 Typical angina     110         211 FALSE            144
## # ℹ 196 more rows
## # ℹ 1 more variable: ExerciseInducedAngina <fct>The new depth is one less than the input. In this case, the same result was achieved because of the symmetry.
The next set of functions are rooted in functional programming
(i.e. purrr). In
each instance, given a pre-defined function, you can easily control the
scope of the data in which it is evaluated on. This is an incredibly
useful and efficient philosophy for generating reusable code that is
non-repetitive.
Often times a statistical analysis is concerned with multiple
outcomes/targets, though each one potentially uses the same set of
explanatory variables. You could of course hard code the specific
analysis steps for each outcome (e.g. formulas, etc.), but then the code
becomes repetitive. You could also make separate data sets for each
outcome with the associated predictors and then apply the same analysis
process to each data set, but then you waste resources by storing
multiple copies of the same data in memory. dish() allows
you to distribute a function to various pairwise sets of columns from
the same data set in a single call. The motivation described above is
merely that–this is a general function that can be applied to any
context.
As a simple first example, lets compute the Pearson correlation of blood pressure and cholesterol with all numeric columns:
## $BP
## $BP$Age
## [1] 0.2849459
## 
## $BP$BP
## [1] 1
## 
## $BP$Cholesterol
## [1] 0.1301201
## 
## $BP$MaximumHR
## [1] -0.04535088
## 
## 
## $Cholesterol
## $Cholesterol$Age
## [1] 0.2089503
## 
## $Cholesterol$BP
## [1] 0.1301201
## 
## $Cholesterol$Cholesterol
## [1] 1
## 
## $Cholesterol$MaximumHR
## [1] -0.003431832The left argument controls which columns are entered in
the first argument of f, and the right
argument controls which columns are entered into the second. In the
example, you’ll notice that the left columns were also
included in the set of right columns, because they are, in
fact, numeric(). If you didn’t want this, you can omit the
right argument, which will then include everything not in
left as the second argument of f:
heart_disease %>%
  dplyr::select(where(is.numeric)) %>% #Need to do this so only numeric columns are evaluated
  dish(
    f = cor,
    left = c(BP, Cholesterol)
  )## $BP
## $BP$Age
## [1] 0.2849459
## 
## $BP$MaximumHR
## [1] -0.04535088
## 
## 
## $Cholesterol
## $Cholesterol$Age
## [1] 0.2089503
## 
## $Cholesterol$MaximumHR
## [1] -0.003431832Now lets suppose you want to regress both blood pressure and
cholesterol on all other variables in the data set. The
each_ arguments allow you to control whether the column
sets are entered into the function individually as vectors, or together
in a single data frame. The former is the default, so you’ll need to use
this argument here:
heart_disease %>%
  dish(
    f = function(y, x) lm(y ~ ., data = x),
    left = c(BP, Cholesterol),
    each_right = FALSE
  )## $BP
## 
## Call:
## lm(formula = y ~ ., data = x)
## 
## Coefficients:
##               (Intercept)                        Age  
##                  97.09264                    0.50748  
##                   SexMale   ChestPainAtypical angina  
##                  -3.95940                   -9.84788  
## ChestPainNon-anginal pain      ChestPainAsymptomatic  
##                  -9.56350                  -10.11788  
##            BloodSugarTRUE                  MaximumHR  
##                   6.70106                    0.09688  
##  ExerciseInducedAnginaYes            HeartDiseaseYes  
##                   1.72906                    6.00819  
## 
## 
## $Cholesterol
## 
## Call:
## lm(formula = y ~ ., data = x)
## 
## Coefficients:
##               (Intercept)                        Age  
##                  127.5108                     1.2471  
##                   SexMale   ChestPainAtypical angina  
##                  -23.6185                     8.8270  
## ChestPainNon-anginal pain      ChestPainAsymptomatic  
##                    5.7660                     8.0778  
##            BloodSugarTRUE                  MaximumHR  
##                   -0.7327                     0.3491  
##  ExerciseInducedAnginaYes            HeartDiseaseYes  
##                    7.8039                    12.5303stratiply() streamlines the familiar
split() then purrr::map() approach by making
it simple to select the stratification columns, and apply a function to
each subset. Let’s run a multiple logistic regression model using heart
disease as the outcome, stratified by sex:
heart_disease %>%
  stratiply(
    f = glm,
    by = Sex,
    formula = HeartDisease ~ . -ChestPain,
    family = "binomial"
  )## $Female
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##              (Intercept)                       Age                        BP  
##                 -6.95485                   0.03362                   0.04366  
##              Cholesterol            BloodSugarTRUE                 MaximumHR  
##                  0.00354                   0.47028                  -0.02465  
## ExerciseInducedAnginaYes  
##                  2.12039  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  90 Residual
## Null Deviance:       110.7 
## Residual Deviance: 76.21     AIC: 90.21
## 
## $Male
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##              (Intercept)                       Age                        BP  
##                 1.877142                  0.024853                  0.007710  
##              Cholesterol            BloodSugarTRUE                 MaximumHR  
##                 0.008718                 -0.390310                 -0.042231  
## ExerciseInducedAnginaYes  
##                 1.106387  
## 
## Degrees of Freedom: 205 Total (i.e. Null);  199 Residual
## Null Deviance:       283.2 
## Residual Deviance: 211.2     AIC: 225.2Adding multiple stratification columns will produce a deeper list:
heart_disease %>%
  stratiply(
    f = glm,
    by = c(Sex, ExerciseInducedAngina),
    formula = HeartDisease ~ . -ChestPain,
    family = "binomial"
  )## $Female
## $Female$No
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##    (Intercept)             Age              BP     Cholesterol  BloodSugarTRUE  
##       -9.58123         0.01027         0.07213         0.00406         0.65807  
##      MaximumHR  
##       -0.02540  
## 
## Degrees of Freedom: 74 Total (i.e. Null);  69 Residual
## Null Deviance:       62.53 
## Residual Deviance: 49.93     AIC: 61.93
## 
## $Female$Yes
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##    (Intercept)             Age              BP     Cholesterol  BloodSugarTRUE  
##       3.850121        0.049450        0.012214        0.001458        0.945153  
##      MaximumHR  
##      -0.056538  
## 
## Degrees of Freedom: 21 Total (i.e. Null);  16 Residual
## Null Deviance:       28.84 
## Residual Deviance: 23.36     AIC: 35.36
## 
## 
## $Male
## $Male$No
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##    (Intercept)             Age              BP     Cholesterol  BloodSugarTRUE  
##       2.728309        0.053058       -0.003653        0.005387       -1.106214  
##      MaximumHR  
##      -0.042036  
## 
## Degrees of Freedom: 128 Total (i.e. Null);  123 Residual
## Null Deviance:       174 
## Residual Deviance: 142.3     AIC: 154.3
## 
## $Male$Yes
## 
## Call:  .f(formula = ..1, family = "binomial", data = .x[[i]])
## 
## Coefficients:
##    (Intercept)             Age              BP     Cholesterol  BloodSugarTRUE  
##        0.70070        -0.01348         0.03561         0.01342         1.09404  
##      MaximumHR  
##       -0.04531  
## 
## Degrees of Freedom: 76 Total (i.e. Null);  71 Residual
## Null Deviance:       75.94 
## Residual Deviance: 60.53     AIC: 72.53typly() allows you to apply a function to columns of a
data frame whose type inherits at least one (or none) of the specified
candidates. It is similar to purrr::map_if(), but uses
methods::is() for determining types (see
some_type()), and only returns the results for the selected
columns:
## $Age
## [1] 54.43894
## 
## $BP
## [1] 131.6898
## 
## $Cholesterol
## [1] 246.6931
## 
## $BloodSugar
## [1] 0.1485149
## 
## $MaximumHR
## [1] 149.6073You can use the negated argument to apply the function
to columns that are not any of the listed types:
heart_disease %>%
  typly(
    f = table,
    types = c("numeric", "logical"),
    negated = TRUE,
    useNA = "always"
  )## $Sex
## 
## Female   Male   <NA> 
##     97    206      0 
## 
## $ChestPain
## 
##   Typical angina  Atypical angina Non-anginal pain     Asymptomatic 
##               23               50               86              144 
##             <NA> 
##                0 
## 
## $ExerciseInducedAngina
## 
##   No  Yes <NA> 
##  204   99    0 
## 
## $HeartDisease
## 
##   No  Yes <NA> 
##  164  139    0absorb() allows you to create collections of strings
that use keys as placeholders, which are then populated by the values.
This is useful for setting up results of your analysis to be presented
in a specific format that is independent of the data set at hand. Here’s
a simple example:
absorb(
  key = c("mean", "sd", "var"),
  value = c("10", "2", "4"),
  text = 
    c(
      "MEAN: mean, SD: sd",
      "VAR: var = sd^2",
      MEAN = "mean"
    )
)##                                                  MEAN 
## "MEAN: 10, SD: 2"    "VAR: 4 = 2^2"              "10"The input text is scanned to look for the keys
(interpreted as regular expressions) and are replaced with the
corresponding values.
Let’s look at a more useful example. Suppose we have a collection summary statistics for patient age by angina and heart disease status:
age_stats <-
  heart_disease %>%
  dplyr::group_by(
    ExerciseInducedAngina,
    HeartDisease
  ) %>%
  dplyr::summarise(
    mean = round(mean(Age), 2),
    sd = round(sd(Age), 2),
    median = median(Age),
    iqr = IQR(Age),
    .groups = "drop"
  ) %>%
  tidyr::gather(
    key = "key",
    value = "value",
    -ExerciseInducedAngina,
    -HeartDisease
  )
age_stats## # A tibble: 16 × 4
##    ExerciseInducedAngina HeartDisease key    value
##    <fct>                 <fct>        <chr>  <dbl>
##  1 No                    No           mean   52.4 
##  2 No                    Yes          mean   57.1 
##  3 Yes                   No           mean   53.6 
##  4 Yes                   Yes          mean   56.2 
##  5 No                    No           sd      9.68
##  6 No                    Yes          sd      7.56
##  7 Yes                   No           sd      8.54
##  8 Yes                   Yes          sd      8.27
##  9 No                    No           median 52   
## 10 No                    Yes          median 58   
## 11 Yes                   No           median 53   
## 12 Yes                   Yes          median 57   
## 13 No                    No           iqr    15   
## 14 No                    Yes          iqr    10   
## 15 Yes                   No           iqr    12.5 
## 16 Yes                   Yes          iqr     8.25Next, you can specify the string format you want to nicely summarize the information in each group:
age_stats %>%
  dplyr::group_by(
    ExerciseInducedAngina,
    HeartDisease
  ) %>%
  dplyr::summarise(
    Summary = 
      absorb(
        key = key,
        value = value,
        text = 
          c(
            "mean (sd) | median (iqr)"
          )
      ),
    .groups = "drop"
  ) ## # A tibble: 4 × 3
##   ExerciseInducedAngina HeartDisease Summary                 
##   <fct>                 <fct>        <chr>                   
## 1 No                    No           52.42 (9.68) | 52 (15)  
## 2 No                    Yes          57.1 (7.56) | 58 (10)   
## 3 Yes                   No           53.61 (8.54) | 53 (12.5)
## 4 Yes                   Yes          56.24 (8.27) | 57 (8.25)Of course, this is much more useful when the summary data is already
in the format of age_stats.
In the case where the key pattern found in the string template is repeated, its values are collapsed into a concatenated string:
## [1] "(52.42_57.1_53.61_56.24) / (9.68_7.56_8.54_8.27)"
## [2] "52_58_53_57"The sep argument can be used to customize how values are
separated.
It may be useful in some cases to evaluate the resulting string as an expression. You can setup the prior example so that the result contains valid expressions:
absorb(
  key = age_stats$key,
  value = age_stats$value,
  text = c("(mean) / (sd)", "median"),
  sep = "+",
  evaluate = TRUE,
  trace = TRUE
)## (mean) / (sd) median 
## (mean) / (sd) median 
## (52.42+57.1+53.61+56.24) / (sd) median 
## (52.42+57.1+53.61+56.24) / (sd) 52+58+53+57## [[1]]
## [1] 6.442584
## 
## [[2]]
## [1] 220These statistics are not entirely useful here, but you get the point.
depths() and depths_string() are used for
traversing a list structure to find the elements that satisfy a
predicate. The former produces a vector of the unique depths that
contain at least one element where predicate is true, and
the latter creates a string representation of the traversal with the
actual positions.
#Make a divided data frame
div_dat1 <-
  heart_disease %>%
  divide(
    Sex,
    HeartDisease
  )
#Find the depths of the data frames
div_dat1 %>%
  depths(
    predicate = is.data.frame
  )## [1] 2## [1] "{1{-1,-2},2{-1,-2}}"In the string result, the brackets represent the level of the list,
and the integers represent the index at that level, which are negative
if predicate is true for that element.
By default, the algorithm continues when
rlang::is_bare_list() so that the traversal can end with
list-like objects. However, the bare argument can be used
to traverse deeper into the list:
## [1] "{1{-1{1,2,3,4,5,6,7},-2{1,2,3,4,5,6,7}},2{-1{1,2,3,4,5,6,7},-2{1,2,3,4,5,6,7}}}"Now it also evaluated the columns of each data frame in the list, though it still found the correct positions where the predicate held.
muddle() can be used to randomly shuffle columns in a
data frame. This is a convenient way to remove confounding when
exploring the effects of a variable on an outcome.
## # A tibble: 303 × 9
##      Age Sex    ChestPain           BP Cholesterol BloodSugar MaximumHR
##    <dbl> <fct>  <fct>            <dbl>       <dbl> <lgl>          <dbl>
##  1    43 Female Non-anginal pain   130         214 FALSE            120
##  2    44 Male   Asymptomatic       130         204 FALSE            143
##  3    68 Female Asymptomatic       128         242 FALSE            125
##  4    35 Female Asymptomatic       120         269 FALSE            163
##  5    45 Female Atypical angina    138         240 FALSE            169
##  6    54 Male   Asymptomatic       128         209 FALSE            182
##  7    61 Male   Asymptomatic       100         258 FALSE            152
##  8    57 Female Asymptomatic       152         229 FALSE            142
##  9    67 Male   Non-anginal pain   110         283 FALSE            138
## 10    51 Female Atypical angina    125         204 TRUE             161
## # ℹ 293 more rows
## # ℹ 2 more variables: ExerciseInducedAngina <fct>, HeartDisease <fct>All columns are permuted by default. Use the at argument
to only shuffle certain ones:
## # A tibble: 303 × 9
##      Age Sex    ChestPain           BP Cholesterol BloodSugar MaximumHR
##    <dbl> <fct>  <fct>            <dbl>       <dbl> <lgl>          <dbl>
##  1    47 Male   Typical angina     145         233 TRUE             150
##  2    49 Male   Asymptomatic       160         286 FALSE            108
##  3    59 Male   Asymptomatic       120         229 FALSE            129
##  4    35 Male   Non-anginal pain   130         250 FALSE            187
##  5    44 Female Atypical angina    130         204 FALSE            172
##  6    29 Male   Atypical angina    120         236 FALSE            178
##  7    45 Female Asymptomatic       140         268 FALSE            160
##  8    59 Female Asymptomatic       120         354 FALSE            163
##  9    46 Male   Asymptomatic       130         254 FALSE            147
## 10    54 Male   Asymptomatic       140         203 TRUE             155
## # ℹ 293 more rows
## # ℹ 2 more variables: ExerciseInducedAngina <fct>, HeartDisease <fct>Additional arguments can be passed to sample() as well.
For example, you might want five random values from each column:
## # A tibble: 5 × 9
##     Age Sex    ChestPain           BP Cholesterol BloodSugar MaximumHR
##   <dbl> <fct>  <fct>            <dbl>       <dbl> <lgl>          <dbl>
## 1    54 Female Asymptomatic       136         275 TRUE             163
## 2    62 Male   Asymptomatic       125         271 FALSE            195
## 3    52 Male   Non-anginal pain   150         286 FALSE            105
## 4    64 Male   Asymptomatic       106         273 FALSE            163
## 5    54 Male   Atypical angina    120         199 FALSE            182
## # ℹ 2 more variables: ExerciseInducedAngina <fct>, HeartDisease <fct>stretch() is for spanning keys and values across the
columns. It is similar to tidyr::spread(), but allows for
any number of keys and values to be selected. It’s functionality is more
so targeted at setting up your results to be presented in a table,
rather than for general data manipulation. Thus, the resulting column
names and orderings are setup to make it a seamless transition.
To illustrate, lets first collect the parameter estimates and standard errors from multiple regression models using blood pressure and cholesterol as outcomes, stratified by sex:
mod_results <-
  heart_disease %>%
  
  #For each sex
  stratiply(
    by = Sex,
    f =
      ~.x %>%
      
      #Collect coefficients and standard errors for model estimates on multiple outcomes
      dish(
        f = function(y, x) {
          model <- lm(y ~ ., data = x)
          tibble::tibble(
            Parameter = names(model$coefficients),
            Estimate = model$coefficients,
            SE = sqrt(diag(vcov(model)))
          )
        },
        left = c(BP, Cholesterol),
        each_right = FALSE
      ) 
  ) %>%
  
  #Gather results into a single data frame
  fasten(
    into = c("Sex", "Outcome")
  )
mod_results## # A tibble: 36 × 5
##    Sex    Outcome     Parameter                 Estimate      SE
##    <chr>  <chr>       <chr>                        <dbl>   <dbl>
##  1 Female BP          (Intercept)                 93.7   24.7   
##  2 Female BP          Age                          0.533  0.210 
##  3 Female BP          ChestPainAtypical angina   -16.3    9.73  
##  4 Female BP          ChestPainNon-anginal pain  -15.3    9.15  
##  5 Female BP          ChestPainAsymptomatic      -14.2    9.64  
##  6 Female BP          BloodSugarTRUE               6.92   5.53  
##  7 Female BP          MaximumHR                    0.123  0.0993
##  8 Female BP          ExerciseInducedAnginaYes     8.00   4.89  
##  9 Female BP          HeartDiseaseYes             12.1    5.21  
## 10 Female Cholesterol (Intercept)                -12.1   91.6   
## # ℹ 26 more rowsNow lets span the estimates and standard errors across the columns for both outcomes:
## # A tibble: 18 × 6
##    Sex    Parameter      BP_Estimate   BP_SE Cholesterol_Estimate Cholesterol_SE
##    <chr>  <chr>                <dbl>   <dbl>                <dbl>          <dbl>
##  1 Female (Intercept)        93.7    24.7                 -12.1           91.6  
##  2 Female Age                 0.533   0.210                 2.30           0.779
##  3 Female BloodSugarTRUE      6.92    5.53                 24.6           20.6  
##  4 Female ChestPainAsym…    -14.2     9.64                 34.0           35.8  
##  5 Female ChestPainAtyp…    -16.3     9.73                 22.7           36.1  
##  6 Female ChestPainNon-…    -15.3     9.15                 34.1           34.0  
##  7 Female ExerciseInduc…      8.00    4.89                  8.14          18.2  
##  8 Female HeartDiseaseY…     12.1     5.21                  5.88          19.3  
##  9 Female MaximumHR           0.123   0.0993                0.720          0.369
## 10 Male   (Intercept)       105.     14.7                 151.            38.7  
## 11 Male   Age                 0.480   0.142                 0.765          0.374
## 12 Male   BloodSugarTRUE      4.96    3.12                 -6.85           8.23 
## 13 Male   ChestPainAsym…    -10.0     4.25                  3.38          11.2  
## 14 Male   ChestPainAtyp…     -8.61    4.68                  9.55          12.3  
## 15 Male   ChestPainNon-…     -6.99    4.31                 -0.802         11.4  
## 16 Male   ExerciseInduc…     -1.08    2.78                  6.88           7.33 
## 17 Male   HeartDiseaseY…      3.07    2.80                 13.4            7.39 
## 18 Male   MaximumHR           0.0392  0.0601                0.239          0.158The sep argument controls how the new column names are
concatenated.
Now suppose you wanted to also span sex across the columns. It can just be added to the keys:
## # A tibble: 9 × 9
##   Parameter               Female_BP_Estimate Female_BP_SE Female_Cholesterol_E…¹
##   <chr>                                <dbl>        <dbl>                  <dbl>
## 1 (Intercept)                         93.7        24.7                   -12.1  
## 2 Age                                  0.533       0.210                   2.30 
## 3 BloodSugarTRUE                       6.92        5.53                   24.6  
## 4 ChestPainAsymptomatic              -14.2         9.64                   34.0  
## 5 ChestPainAtypical angi…            -16.3         9.73                   22.7  
## 6 ChestPainNon-anginal p…            -15.3         9.15                   34.1  
## 7 ExerciseInducedAnginaY…              8.00        4.89                    8.14 
## 8 HeartDiseaseYes                     12.1         5.21                    5.88 
## 9 MaximumHR                            0.123       0.0993                  0.720
## # ℹ abbreviated name: ¹Female_Cholesterol_Estimate
## # ℹ 5 more variables: Female_Cholesterol_SE <dbl>, Male_BP_Estimate <dbl>,
## #   Male_BP_SE <dbl>, Male_Cholesterol_Estimate <dbl>,
## #   Male_Cholesterol_SE <dbl>Notice how the resulting columns are positioned. They are
sequentially sorted starting with the outer-most key in a hierarchical
manner. Since there are multiple value columns in this
example, their names are also appended to the result in the order in
which they were requested.
grable() stands for “graded table”. It is used to stack
headers into a knitr::kable() in an automated fashion by
iteratively parsing the column names by the specified delimiter. The
result of stretch() was purposely made to flow directly
into this function.
| Sex | Parameter | Estimate | SE | Estimate | SE | 
|---|---|---|---|---|---|
| Female | (Intercept) | 93.7187493 | 24.6536900 | -12.1297785 | 91.6044593 | 
| Female | Age | 0.5330329 | 0.2096016 | 2.2975419 | 0.7788061 | 
| Female | BloodSugarTRUE | 6.9178015 | 5.5329199 | 24.5796821 | 20.5583885 | 
| Female | ChestPainAsymptomatic | -14.1810178 | 9.6423236 | 34.0195680 | 35.8274902 | 
| Female | ChestPainAtypical angina | -16.3247037 | 9.7289213 | 22.6736038 | 36.1492568 | 
| Female | ChestPainNon-anginal pain | -15.3471449 | 9.1541813 | 34.1239627 | 34.0137249 | 
| Female | ExerciseInducedAnginaYes | 7.9952541 | 4.8947727 | 8.1371920 | 18.1872573 | 
| Female | HeartDiseaseYes | 12.0913966 | 5.2058240 | 5.8795266 | 19.3430149 | 
| Female | MaximumHR | 0.1226586 | 0.0992801 | 0.7201695 | 0.3688901 | 
| Male | (Intercept) | 105.2923383 | 14.6933205 | 150.9743256 | 38.7467039 | 
| Male | Age | 0.4800104 | 0.1417168 | 0.7654107 | 0.3737111 | 
| Male | BloodSugarTRUE | 4.9616363 | 3.1197639 | -6.8540853 | 8.2269061 | 
| Male | ChestPainAsymptomatic | -10.0474949 | 4.2547960 | 3.3795322 | 11.2200180 | 
| Male | ChestPainAtypical angina | -8.6148546 | 4.6798629 | 9.5540927 | 12.3409315 | 
| Male | ChestPainNon-anginal pain | -6.9875791 | 4.3056911 | -0.8015539 | 11.3542299 | 
| Male | ExerciseInducedAnginaYes | -1.0827067 | 2.7780388 | 6.8846691 | 7.3257672 | 
| Male | HeartDiseaseYes | 3.0740495 | 2.8029917 | 13.4020228 | 7.3915687 | 
| Male | MaximumHR | 0.0391572 | 0.0600935 | 0.2387579 | 0.1584683 | 
The at argument specifies which columns should be
spanned in the header. Let’s try it with the two-key example from
above:
| Parameter | Estimate | SE | Estimate | SE | Estimate | SE | Estimate | SE | 
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 93.7187493 | 24.6536900 | -12.1297785 | 91.6044593 | 105.2923383 | 14.6933205 | 150.9743256 | 38.7467039 | 
| Age | 0.5330329 | 0.2096016 | 2.2975419 | 0.7788061 | 0.4800104 | 0.1417168 | 0.7654107 | 0.3737111 | 
| BloodSugarTRUE | 6.9178015 | 5.5329199 | 24.5796821 | 20.5583885 | 4.9616363 | 3.1197639 | -6.8540853 | 8.2269061 | 
| ChestPainAsymptomatic | -14.1810178 | 9.6423236 | 34.0195680 | 35.8274902 | -10.0474949 | 4.2547960 | 3.3795322 | 11.2200180 | 
| ChestPainAtypical angina | -16.3247037 | 9.7289213 | 22.6736038 | 36.1492568 | -8.6148546 | 4.6798629 | 9.5540927 | 12.3409315 | 
| ChestPainNon-anginal pain | -15.3471449 | 9.1541813 | 34.1239627 | 34.0137249 | -6.9875791 | 4.3056911 | -0.8015539 | 11.3542299 | 
| ExerciseInducedAnginaYes | 7.9952541 | 4.8947727 | 8.1371920 | 18.1872573 | -1.0827067 | 2.7780388 | 6.8846691 | 7.3257672 | 
| HeartDiseaseYes | 12.0913966 | 5.2058240 | 5.8795266 | 19.3430149 | 3.0740495 | 2.8029917 | 13.4020228 | 7.3915687 | 
| MaximumHR | 0.1226586 | 0.0992801 | 0.7201695 | 0.3688901 | 0.0391572 | 0.0600935 | 0.2387579 | 0.1584683 | 
The tables can then be piped through subsequent customized processing
with kableExtra
tools.
descriptives() produces a data frame in long format with
a collection of descriptive statistics.
## # A tibble: 111 × 9
##    fun_eval fun_key   col_ind col_lab    val_ind val_lab val_dbl val_chr val_cbn
##    <chr>    <chr>       <int> <chr>        <int> <chr>     <dbl> <chr>   <chr>  
##  1 all      available       1 Age              1 <NA>        303 <NA>    303    
##  2 all      available       2 Sex              1 <NA>        303 <NA>    303    
##  3 all      available       3 ChestPain        1 <NA>        303 <NA>    303    
##  4 all      available       4 BP               1 <NA>        303 <NA>    303    
##  5 all      available       5 Cholester…       1 <NA>        303 <NA>    303    
##  6 all      available       6 BloodSugar       1 <NA>        303 <NA>    303    
##  7 all      available       7 MaximumHR        1 <NA>        303 <NA>    303    
##  8 all      available       8 ExerciseI…       1 <NA>        303 <NA>    303    
##  9 all      available       9 HeartDise…       1 <NA>        303 <NA>    303    
## 10 all      class           1 Age              1 <NA>         NA numeric numeric
## # ℹ 101 more rowsunivariate_associations() is a special case of
dish() specifically for computing association metrics of
one or more responses with a collection of
predictors.
heart_disease %>%
  univariate_associations(
    f = list(AIC = function(y, x) AIC(lm(y ~ x))),
    responses = c(BP, Cholesterol)
  )## # A tibble: 14 × 3
##    response    predictor               AIC
##    <chr>       <chr>                 <dbl>
##  1 BP          Age                   2577.
##  2 BP          Sex                   2602.
##  3 BP          ChestPain             2598.
##  4 BP          BloodSugar            2593.
##  5 BP          MaximumHR             2602.
##  6 BP          ExerciseInducedAngina 2602.
##  7 BP          HeartDisease          2596.
##  8 Cholesterol Age                   3243.
##  9 Cholesterol Sex                   3244.
## 10 Cholesterol ChestPain             3259.
## 11 Cholesterol BloodSugar            3257.
## 12 Cholesterol MaximumHR             3257.
## 13 Cholesterol ExerciseInducedAngina 3256.
## 14 Cholesterol HeartDisease          3255.univariate_table() allows you to create a custom
descriptive table for a data set. It uses almost every function in the
package across its span of capabilities.
| Variable | Level | Summary | 
|---|---|---|
| Age | 56 (48, 61) | |
| Sex | Female | 97 (32.01%) | 
| Male | 206 (67.99%) | |
| ChestPain | Typical angina | 23 (7.59%) | 
| Atypical angina | 50 (16.5%) | |
| Non-anginal pain | 86 (28.38%) | |
| Asymptomatic | 144 (47.52%) | |
| BP | 130 (120, 140) | |
| Cholesterol | 241 (211, 275) | |
| MaximumHR | 153 (133.5, 166) | |
| ExerciseInducedAngina | No | 204 (67.33%) | 
| Yes | 99 (32.67%) | |
| HeartDisease | No | 164 (54.13%) | 
| Yes | 139 (45.87%) | 
See vignette("describe") for a more in-depth
introduction to these functions.