---
title: "Revisiting Whittaker-Henderson Smoothing"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Revisiting Whittaker-Henderson Smoothing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: biblio.bib
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "",
  fig.width = 10, 
  fig.asp = 0.618,
  fig.align = "center",
  out.width = "100%"
)

library(WH)
```

## What is Whittaker-Henderson smoothing ?

### Origin

Whittaker-Henderson (WH) smoothing is a graduation method designed to mitigate the effects of sampling fluctuations in a vector of evenly spaced discrete observations. Although this method was originally proposed by @bohlmann1899ausgleichungsproblem, it is named after @whittaker1923new, who applied it to graduate mortality tables, and @henderson1924new, who popularized it among actuaries in the United States. The method was later extended to two dimensions by @knorr1984multidimensional. WH smoothing may be used to build experience tables for a broad spectrum of life insurance risks, such as mortality, disability, long-term care, lapse, mortgage default and unemployment.

### The one-dimensional case

Let $\mathbf{y}$ be a vector of observations and $\mathbf{w}$ a vector of positive weights, both of size $n$. The estimator associated with Whittaker-Henderson smoothing is given by:

$$
\hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}}\{F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) + R_{\lambda,q}(\boldsymbol{\theta})\}
$$

where:

- $F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = \underset{i = 1}{\overset{n}{\sum}} w_i(y_i - \theta_i)^2$ represents a fidelity criterion with respect to the observations,

- $R_{\lambda,q}(\boldsymbol{\theta}) = \lambda \underset{i = 1}{\overset{n - q}{\sum}} (\Delta^q\boldsymbol{\theta})_i^2$ represents a smoothness criterion.

In the latter expression, $\lambda \ge 0$ is a smoothing parameter and $\Delta^q$ denotes the forward difference operator of order $q$, such that for any $i\in\{1,\dots,n - q\}$:

$$
(\Delta^q\boldsymbol{\theta})_i = \underset{k = 0}{\overset{q}{\sum}} \begin{pmatrix}q \\ k\end{pmatrix}(- 1)^{q - k} \theta_{i + k}.
$$

Define $W = \text{Diag}(\mathbf{w})$, the diagonal matrix of weights, and $D_{n,q}$ as the order $q$ difference matrix of dimensions $(n-q) \times n$, such that $(D_{n,q}\boldsymbol{\theta})_i = (\Delta^q\boldsymbol{\theta})_i$ for all $i \in [1, n-q]$. The first- and second-order difference matrices are given by:

$$
D_{n,1} = \begin{bmatrix}
-1 & 1 & 0 & \ldots & 0 \\
0 & -1 & 1 & \ldots & 0 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
0 & \ldots & 0 & -1 & 1 \\
\end{bmatrix}
\quad\text{and}\quad
D_{n,2} = \begin{bmatrix}
1 & -2 & 1 & 0 & \ldots & 0 \\
0 & 1 & -2 & 1 & \ldots & 0 \\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
0 & \ldots & 0 & 1 & -2 & 1 \\
\end{bmatrix}.
$$

while higher-order difference matrices follow the recursive formula $D_{n,q} = D_{n - 1,q - 1}D_{n,1}$. The fidelity and smoothness criteria can be rewritten with matrix notations as:

$$
F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \quad \text{and} \quad R_{\lambda,q}(\boldsymbol{\theta}) = \lambda\boldsymbol{\theta}^TD_{n,q}^TD_{n,q}\boldsymbol{\theta}
$$

and the WH smoothing estimator thus becomes:

$$
\hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}} \left\lbrace(\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right\rbrace
$$

where $P_{\lambda} = \lambda D_{n,q}^TD_{n,q}$.

### The two-dimensional case

In the two-dimensional case, consider a matrix $Y$ of observations and a matrix $\Omega$ of non-negative weights, both of dimensions $n_x \times n_z$. The WH smoothing estimator solves:

$$
\widehat{Y} = \underset{\Theta}{\text{argmin}}\{F(Y,\Omega, \Theta) + R_{\lambda,q}(\Theta)\}
$$

where:

- $F(Y,\Omega, \Theta) = \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z} \Omega_{i,j}(Y_{i,j} - \Theta_{i,j})^2$ represents a fidelity criterion with respect to the observations,

- $R_{\lambda,q}(\Theta) = \lambda_x \sum_{j = 1}^{n_z}\sum_{i = 1}^{n_x - q_x} (\Delta^{q_x}\Theta_{\bullet,j})_i^2 + \lambda_z \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z - q_z} (\Delta^{q_z}\Theta_{i,\bullet})_j^2$ is a smoothness criterion with $\lambda = (\lambda_x,\lambda_z)$.

This latter criterion adds row-wise and column regularization criteria to $\Theta$, with respective orders $q_x$ and $q_z$, weighted by non-negative smoothing parameters $\lambda_x$ and $\lambda_z$. In matrix notation, let $\mathbf{y} = \textbf{vec}(Y)$, $\mathbf{w} = \textbf{vec}(\Omega)$, and $\boldsymbol{\theta} = \textbf{vec}(\Theta)$ as the vectors obtained by stacking the columns of the matrices $Y$, $\Omega$, and $\Theta$, respectively. Additionally, denote $W = \text{Diag}(\mathbf{w})$ and $n = n_x \times n_z$. The fidelity and smoothness criteria become:

$$
\begin{aligned}
F(\mathbf{y},\mathbf{w}, \boldsymbol{\theta}) &= (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \\
R_{\lambda,q}(\boldsymbol{\theta}) &= \boldsymbol{\theta}^{T}(\lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}) \boldsymbol{\theta}.
\end{aligned}
$$

and the associated estimator takes the same form as in the one-dimensional case except 

$$P_{\lambda} = \lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}.$$

### An explicit solution

If $W + P_{\lambda}$ is invertible, the WH smoothing equation admits the closed-form solution:

$$\hat{\mathbf{y}} = (W + P_{\lambda})^{-1}W\mathbf{y}.$$

Indeed, as a minimum, $\hat{\mathbf{y}}$ satisfies:

$$0 = \left.\frac{\partial}{\partial \boldsymbol{\theta}}\right|_{\hat{\mathbf{y}}}\left\lbrace(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^{T}P_{\lambda}\boldsymbol{\theta}\right\rbrace = - 2 W(y - \hat{\mathbf{y}}) +2P_{\lambda} \hat{\mathbf{y}}.$$

It follows that $(W + P_{\lambda})\hat{\mathbf{y}} = W\mathbf{y}$, proving the above result. If $\lambda \neq 0$, $W + P_{\lambda}$ is invertible as long as $\mathbf{w}$ has $q$ non-zero elements in the one-dimensional case, and $\Omega$ has at least $q_x \times q_z$ non-zero elements spread across $q_x$ different rows and $q_z$ different columns in the two-dimensional case. These conditions are always met in real datasets.


## How to use the package?


The `WH` package features a unique main function `WH`. Two arguments are mandatory for this function:

* The vector (or matrix in the two-dimension case) `d` corresponding to the number of observed events of interest by age (or by age and duration in the two-dimension case). `d` should have named elements (or rows and columns) for the model results to be extrapolated.

* The vector (or matrix in the two-dimension case) `ec` corresponding to the portfolio central exposure by age (or by age and duration in the two-dimension case) whose dimensions should match those of `d`. The contribution of each individual to the portfolio central exposure corresponds to the time the individual was actually observed with corresponding age (and duration in the two-dimension case). It always ranges from 0 to 1 and is affected by individuals leaving the portfolio, no matter the cause, as well as censoring and truncating phenomena.

Additional arguments may be supplied, whose description is given in the documentation of the functions.

The package also embed two fictive agregated datasets to illustrate how to use it:

* `portfolio_mortality` contains the agregated number of deaths and associated central exposure by age for an annuity portfolio.

* `portfolio_LTC` contains the agregated number of deaths and associated central exposure by age and duration (in years) since the onset of LTC for the annuitant database of a long-term care portfolio.

```{r fit-1d}
# One-dimensional case
WH_1d_fit <- WH(portfolio_mort$d, portfolio_mort$ec)
```

```{r fit-2d}
# Two-dimensional case
WH_2d_fit <- WH(portfolio_LTC$d, portfolio_LTC$ec)
```

Function `WH` outputs objects of class `"WH_1d"` and `"WH_2d"` to which additional functions (including generic S3 methods) may be applied:

* The `print` function provides a glimpse of the fitted results

```{r print}
WH_1d_fit
WH_2d_fit
```

* The `plot` function generates rough plots of the model fit, the associated standard deviation, the model residuals or the associated degrees of freedom. See the `plot.WH_1d` and `plot.WH_2d` functions help for more details.

```{r plot}
plot(WH_1d_fit)
plot(WH_1d_fit, "res")
plot(WH_1d_fit, "edf")

plot(WH_2d_fit)
plot(WH_2d_fit, "std_y_hat")
```

* The `predict` function generates an extrapolation of the model. It requires a `newdata` argument, a named list with one or two elements corresponding to the positions of the new observations. In the two-dimension case constraints are used so that the predicted values matches the fitted values for the initial observations [see @carballo2021prediction to understand why this is required].

```{r predict}
WH_1d_fit |> predict(newdata = 40:99) |> plot()
WH_2d_fit |> predict(newdata = list(age = 60:109, duration = 0:19)) |> plot()
```

* The `vcov` may be used to retrieve the variance-covariance matrix of the model if necessary.

* Finally the `output_to_df` function converts an `"WH_1d"` or `"WH_2d"` object into a `data.frame`. Information about the fit is discarded in the process. This function may be useful to produce better visualizations from the data, for example using the ggplot2 package.

```{r}
WH_1d_df <- WH_1d_fit |> output_to_df()
WH_2d_df <- WH_2d_fit |> output_to_df()
```

## Further WH smoothing theory

See the upcoming paper available [here](https://hal.science/hal-04124043)

## References {-}