--- title: "ale package handling of various input datatypes" author: "Chitu Okoli" date: "August 27, 2025" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{ale package handling of various input datatypes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r knitr, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette demonstrates how `{ale}` works for various input datatypes. You should first read the [introductory vignette](ale-intro.html "Introduction to the ale package") that explains general functionality of the package; this vignette is a demonstration of specific functionality. We begin by loading the necessary libraries. ```{r load libraries} library(ale) library(dplyr) ``` ## `var_cars`: modified `mtcars` dataset (Motor Trend Car Road Tests) For this demonstration, we use a modified version of the built-in `mtcars` dataset so that it has binary (logical), categorical (factor, that is, non-ordered categories), ordinal (ordered factor), discrete interval (integer), and continuous interval (numeric or double) values. This modified version, called `var_cars`, will let us test all the different basic variations of x variables. For the factor, it adds the country of the car manufacturer. The data is a tibble with 32 observations on 12 variables: | Variable | Format | Description | |----------|---------|------------------------------------------| | mpg | double | Miles/(US) gallon | | cyl | integer | Number of cylinders | | disp | double | Displacement (cu.in.) | | hp | double | Gross horsepower | | drat | double | Rear axle ratio | | wt | double | Weight (1000 lbs) | | qsec | double | 1/4 mile time | | vs | logical | Engine (0 = V-shaped, 1 = straight) | | am | logical | Transmission (0 = automatic, 1 = manual) | | gear | ordered | Number of forward gears | | carb | integer | Number of carburetors | | country | factor | Country of car manufacturer | ```{r print var_cars} print(var_cars) ``` ```{r var_cars summary} summary(var_cars) ``` ## Modelling with ALE and GAM With GAM, only numeric variables can be smoothed, not binary or categorical ones. However, smoothing does not always help improve the model since some variables are not related to the outcome and some that are related actually do have a simple linear relationship. To keep this demonstration simple, we have done some earlier analysis (not shown here) that determines where smoothing is worthwhile on the modified `var_cars` dataset, so only some of the numeric variables are smoothed. Our goal here is not to demonstrate the best modelling procedure but rather to demonstrate the flexibility of the `{ale}` package. ```{r gam_cars} gam_cars <- mgcv::gam( mpg ~ cyl + disp + hp + drat + wt + s(qsec) + vs + am + gear + carb + country, data = var_cars ) summary(gam_cars) ``` Now we generate ALE data from the `gam_cars` GAM model and plot it. ```{r ale_cars_1D, fig.width=7, fig.height=14} # # To generate the code, uncomment the following lines. # # For speed, this vignette loads a pre-created ALE object. # # For standard models like gam that store their data, # # there is no need to specify the data argument. # ale_cars_1D <- ALE(gam_cars) # saveRDS(ale_cars_1D, file.choose()) ale_cars_1D <- url('https://github.com/tripartio/ale/raw/main/download/ale_cars_1D.0.5.2.rds') |> readRDS() # Print all plots plot(ale_cars_1D) |> print(ncol = 2) ``` We can see that `ALE()` has no trouble modelling any of the datatypes in our sample (logical, factor, ordered, integer, or double). It plots line charts for the numeric predictors and column charts for everything else. The numeric predictors have rug plots that indicate in which ranges of the x (predictor) and y (`mpg`) values data actually exists in the dataset. This helps us to not over-interpret regions where data is sparse. Since column charts are on a discrete scale, they cannot display rug plots. Instead, the percentage of data represented by each column is displayed. We can also generate and plot the ALE data for all two-way interactions. ```{r ale_cars_2D, fig.width=7, fig.height=28} # # To generate the code, uncomment the following lines. # # For speed, this vignette loads a pre-created ALE object. # # For standard models like gam that store their data, # # there is no need to specify the data argument. # ale_cars_2D <- ALE( # gam_cars, # x_cols = list(d2 = TRUE) # ) # saveRDS(ale_cars_2D, file.choose()) ale_cars_2D <- url('https://github.com/tripartio/ale/raw/main/download/ale_cars_2D.0.5.2.rds') |> readRDS() # Print plots plot(ale_cars_2D) |> print( ncol = 2, # By default, at most 20 plots are printed. Set max_print to increase this limit max_print = 100 ) ``` There are no interactions in this model but the point of this demonstration is to show that the `ale` package can handle 2D interactions between just about any pair of interaction types: numeric-numeric, ordinal-binary, categorical-ordinal, etc. Finally, as explained in the vignette on modelling with [small datasets](ale-small-datasets.html "ale package for small datasets"), a more appropriate modelling workflow would require bootstrapping the entire model, not just the ALE data. So, let's do that now. ```{r cars_full, fig.width=7, fig.height=14} # # To generate the code, uncomment the following lines. # # For speed, this vignette loads a pre-created ModelBoot object. # # For standard models like lm that store their data, # # there is no need to specify the data argument. # # 100 bootstrap iterations by default. # mb_cars <- ModelBoot(gam_cars) # saveRDS(mb_cars, file.choose()) mb_cars <- url('https://github.com/tripartio/ale/raw/main/download/mb_cars.0.5.2.rds') |> readRDS() plot(mb_cars) |> print(ncol = 2) ``` (By default, a `ModelBoot` object creates 100 bootstrap samples but, so that this illustration runs faster, we demonstrate it here with only 10 iterations.) With such a small dataset, the bootstrap confidence interval always overlap the median, indicating that this dataset cannot support any claims that any of its variables has a meaningful effect on fuel efficiency (mpg). Considering that the average bootstrapped ALE values suggest various intriguing patterns, the problem is no doubt that the dataset is too small--if more data were collected and analyzed, some of the patterns would probably be confirmed.