--- title: "Getting Started with xplainfi" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with xplainfi} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.height = 6 ) set.seed(123) # Quiet down lgr::get_logger("mlr3")$set_threshold("warn") options("xplain.progress" = interactive()) ``` ```{r setup} library(xplainfi) library(mlr3) library(mlr3learners) library(data.table) library(ggplot2) ``` `xplainfi` provides feature importance methods for machine learning models. It implements several approaches for measuring how much each feature contributes to model performance, with a focus on model-agnostic methods that work with any learner. ## Core Concepts Feature importance methods in `xplainfi` address different but related questions: - **How much does each feature contribute to model performance?** (Permutation Feature Importance) - **What happens when we remove features and retrain?** (Leave-One-Covariate-Out) - **How do features depend on each other?** (Conditional and Relative methods) All methods share a common interface built on [mlr3](https://mlr3.mlr-org.com/), making them easy to use with any task, learner, measure, and resampling strategy. The general pattern is to call `$compute()` to calculate importance (which _always re-computes_), then `$importance()` to retrieve the aggregated results, with intermediate results available in `$scores()` and, if the chosen measure supports it, `$obs_loss()`. ## Basic Example Let's use the Friedman1 task to demonstrate feature importance methods with known ground truth: ```{r setup-problem} task <- tgen("friedman1")$generate(n = 300) learner <- lrn("regr.ranger", num.trees = 100) measure <- msr("regr.mse") resampling <- rsmp("cv", folds = 3) ``` The task has `r task$nrow` observations with `r length(task$feature_names)` features. Features `important1` through `important5` truly affect the target, while `unimportant1` through `unimportant5` are pure noise. We'll use a random forest learner with cross-validation for more stable estimates. The target function is: $y = 10 \cdot \operatorname{sin}(\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + \epsilon$ ## Permutation Feature Importance (PFI) PFI is the most straightforward method: for each feature, we permute (shuffle) its values and measure how much model performance deteriorates. More important features cause larger performance drops when shuffled. ```{r pfi-basic} pfi <- PFI$new( task = task, learner = learner, measure = measure, resampling = resampling, n_repeats = 10 ) pfi$compute() pfi$importance() ``` The `importance` column shows the performance difference when each feature is permuted. Higher values indicate more important features. For more stable estimates, we can use multiple permutation iterations per resampling fold with `n_repeats`, which is set to `30` by default for all methods. Note that in this case "more is more", and while there is no clear "good enough" value, setting `n_repeats` to a small value like 1 will most definitely yield unreliable results. ```{r pfi-parameters} pfi_stable <- PFI$new( task = task, learner = learner, measure = measure, resampling = resampling, n_repeats = 50 ) pfi_stable$compute() pfi_stable$importance() ``` To illustrate why this is important, we can take a look at the variability of PFI scores for feature `important2` within each resampling iteration using individual importance scores via `$score()` (see below): ```{r pif-nrepeats} pfi_stable$scores()[feature == "important2", ] |> ggplot(aes(y = importance, x = factor(iter_rsmp))) + geom_boxplot() + labs( title = "PFI variability within resampling iterations", subtitle = "Setting n_repeats higher improves PFI estimates", y = "PFI score (important2)", x = "Resampling iteration (3-fold CV)" ) + theme_minimal() ``` ```{r pfi-scores-tmp, echo=FALSE} pfi_important2_range = round(range(pfi_stable$scores()[feature == "important2", importance]), 2) ``` The aggregated importance score for this feature is approximately `r round(pfi_stable$importance()[feature == "important2", importance], 1)`, but across all resamplings the estimated PFI scores range from `r pfi_important2_range[1]` to `r pfi_important2_range[2]`, and with insufficient resampling or low `n_repeats`, we might have over- or underestimated the features PFI by some margin. We can also use the ratio of performance scores instead of their difference for the importance calculation, meaning that an unimportant feature is now expected to get an importance score of 1 rather than 0: ```{r pfi-ratio} pfi_stable$importance(relation = "ratio") ``` ## Leave-One-Covariate-Out (LOCO) LOCO measures importance by retraining the model without each feature and comparing performance to the full model. This shows the contribution of each feature when all other features are present. ```{r loco-basic} loco <- LOCO$new( task = task, learner = learner, measure = measure, resampling = resampling, n_repeats = 10 ) loco$compute() loco$importance() ``` LOCO is computationally expensive as it requires retraining for each feature, but provides clear interpretation: higher values mean larger performance drop when the feature is removed. However, it cannot distinguish between direct effects and indirect effects through correlated features. ## Feature Samplers For advanced methods that account for feature dependencies, `xplainfi` provides different sampling strategies. While PFI uses simple permutation (marginal sampling), conditional samplers can preserve feature relationships. Let's demonstrate conditional sampling using adversarial random forests (ARF), which preserves relationships between features when sampling: ```{r samplers-demo} arf_sampler <- ConditionalARFSampler$new(task) sample_data <- task$data(rows = 1:5) sample_data[, .(important1, important2)] ``` Now we'll conditionally sample the `important1` feature given the values of `important2` and `important3`: ```{r conditional-sampling} sampled_conditional <- arf_sampler$sample_newdata( feature = "important1", newdata = sample_data, conditioning_set = c("important2", "important3") ) sample_data[, .(important1, important2, important3)] sampled_conditional[, .(important1, important2, important3)] ``` This conditional sampling is essential for methods like CFI and RFI that need to preserve feature dependencies. See the [perturbation-importance article](https://mlr-org.github.io/xplainfi/articles/perturbation-importance.html) for detailed comparisons and `vignette("feature-samplers")` for more details on implemented samplers. ## Detailed Scoring Information All methods store detailed scoring information from each resampling iteration for further analysis. Let's examine the structure of PFI's detailed scores: ```{r detailed-scores} pfi$scores() |> head(10) |> knitr::kable(digits = 4, caption = "Detailed PFI scores (first 10 rows)") ``` We can also summarize the scoring structure: ```{r scoring-summary} pfi$scores()[, .( features = uniqueN(feature), resampling_folds = uniqueN(iter_rsmp), permutation_iters = uniqueN(iter_repeat), total_scores = .N )] ``` So `$importance()` always gives us the aggregated importances across multiple resampling- and permutation-/refitting iterations, whereas `$scores()` gives you the individual scores as calculated by the supplied `measure` and the corresponding importance calculated from the difference of these scores by default. Analogous to `$importance()`, you can also use `relation = "ratio"` here: ```{r detailed-scores-ratio} pfi$scores(relation = "ratio") |> head(10) |> knitr::kable(digits = 4, caption = "PFI scores using the ratio (first 10 rows)") ``` ## Observation-wise losses and importances For methods where importances are calculated based on observation-level comparisons and with decomposable measures, we can also retrieve observation-level information with `$obs_loss()`, which works analogously to `$scores()` and `$importance()` but at an even more detailed level: ```{r pfi-obs-scores} pfi$obs_loss() ``` Since we computed PFI using the mean squared error (`msr("regr.mse")`), we can use the associated `Measure$obs_loss()`, the squared error. In the resulting table we see - `loss_baseline`: The loss (squared error) for the baseline model before permutation - `loss_post`: The loss for this observation after permutation (or in the case of `LOCO`, after refit) - `obs_importance`: The difference (or ratio if `relation = "ratio"`) of the two losses Note that not all measures have a `Measure$obs_loss()`: Some measures like `msr("classif.auc")` are not decomposable, so observation-wise loss values are not available. In other cases, the corresponding `obs_loss()` is just not yet implemented in [`mlr3measures`](https://CRAN.R-project.org/package=mlr3measures), but will likely be in the future. ## Statistical Inference All importance methods support confidence intervals and p-values via the `ci_method` argument in `$importance()`. Available approaches range from empirical quantiles and corrected t-tests (Nadeau & Bengio) for resampling-based variability, to observation-wise inference methods like CPI/cARFi (for `CFI`) and Lei et al. (2018) (for `LOCO`). Multiplicity correction via `p_adjust` is supported for all methods that produce p-values. For a comprehensive guide covering all inference methods, see the [Inference for Feature Importance](https://mlr-org.github.io/xplainfi/articles/inference.html) article. ## Using Pre-trained Learners By default, `xplainfi` trains the learner internally via `mlr3::resample()`. However, if you have already trained a learner (for example because training is expensive or you want to explain a specific model) you can pass it directly to perturbation-based methods (`PFI`, `CFI`, `RFI`) and `SAGE` methods. Refit-based methods (`LOCO` / `WVIM`) require retraining by design and will warn if given a pretrained learner. The only requirement is that the `resampling` must be instantiated and have exactly one iteration (i.e., a single test set). This is necessary because a pre-trained learner corresponds to a single fitted model, and there is no meaningful way to associate it with multiple resampling folds. A holdout resampling is the natural choice here. We first train the learner on the train set and `PFI` will calculate importance using the trained learner and the corresponding test set defined by the `resampling`: ```{r pretrained-pfi} resampling_holdout <- rsmp("holdout")$instantiate(task) learner_trained <- lrn("regr.ranger", num.trees = 100) learner_trained$train(task, row_ids = resampling_holdout$train_set(1)) pfi_pretrained <- PFI$new( task = task, learner = learner_trained, measure = measure, resampling = resampling_holdout, n_repeats = 10 ) pfi_pretrained$compute() pfi_pretrained$importance() ``` A common real-world scenario is that the learner was trained on some dataset and you want to explain the model on entirely new, unseen data. In that case, create a task from the new data (via `as_task_regr()` for example) and use `rsmp("custom")` to designate all rows as the test set. The resampling here is purely a technicality used for internal consistency, and the train set is irrelevant since the learner is already trained. A utility function `rsmp_all_test()` can be used as a shortcut to achieve the same goal. ```{r pretrained-custom} # Simulate: learner was trained elsewhere, we have new data to use new_data <- tgen("friedman1")$generate(n = 100) # Same as rsmp_all_test(task) resampling_custom <- rsmp("custom")$instantiate( new_data, train_sets = list(integer(0)), test_sets = list(new_data$row_ids) ) pfi_newdata <- PFI$new( task = new_data, learner = learner_trained, measure = measure, resampling = resampling_custom, n_repeats = 10 ) pfi_newdata$compute() pfi_newdata$importance() ``` If you pass a trained learner with a multi-fold or non-instantiated resampling, you will get an informative error at construction time: ```{r pretrained-error, error = TRUE} PFI$new( task = task, learner = learner_trained, measure = measure, resampling = rsmp("cv", folds = 3) ) ``` ## Parallelization Both PFI/CFI/RFI and LOCO/WVIM support parallel execution to speed up computation when working with multiple features or expensive learners. The parallelization follows mlr3's approach, allowing users to choose between `mirai` and `future` backends. ### Example with future The `future` package provides a simple interface for parallel and distributed computing: ```{r parallel-future, eval = FALSE} library(future) plan("multisession", workers = 2) # PFI with parallelization across features pfi_parallel = PFI$new( task, learner = lrn("regr.ranger"), measure = msr("regr.mse"), n_repeats = 10 ) pfi_parallel$compute() pfi_parallel$importance() # LOCO with parallelization (uses mlr3fselect internally) loco_parallel = LOCO$new( task, learner = lrn("regr.ranger"), measure = msr("regr.mse") ) loco_parallel$compute() loco_parallel$importance() ``` ### Example with mirai The `mirai` package offers a modern alternative for parallel computing: ```{r parallel-mirai, eval = FALSE} library(mirai) daemons(n = 2) # Same PFI/LOCO code works with mirai backend pfi_parallel = PFI$new( task, learner = lrn("regr.ranger"), measure = msr("regr.mse"), n_repeats = 10 ) pfi_parallel$compute() pfi_parallel$importance() # Clean up daemons when done daemons(0) ```