---
title: "Models for Rater Agreement and Reliability"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Models for Rater Agreement and Reliability}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(ordinalTables)
```


# Models for Rater Agreement and Reliability


## What's Unique about Rater Agreement?

From one point of view, nothing.  Rater agreement studies tend to produce r X r tables, where r is the number of rating categories.  As such, the models that test marginal homogeneity and symmetry are quite relevant, asking does one rater score similarly (or higher) than another.  However, rater agreement data are different in that there are usually a surplus of responses on the diagonal, indicating that the raters agree.  Often it makes sense to explicitly model this excess, and the models in this vignette do that.

Another feature of these models is that they are nearly all regular log-linear models.  As such they are less interesting, in that they can be fit with standard log-linear estimation proceudres, including glm() with Poisson family and "log" as the link function.  Where things can be a bit challenging is that one has to explicitly form the design matrix.  For some designs this is a bit of a puzzle, and there are some utility routines to help in the specification.

Much of the modeling in this section comes from chapter 2 of von Eye, A. & Mun, E. Y. (2005), Analyzing rater agreement: Manifest variable methods. Mahwah, NJ: Lawrence Erlbaum.


## The Data

This vignette will use a different source of data from Shoukri (2005), p. 80.  It is agreement study of two clinicians evaluating whether dogs were dehydrated.  The lowest score indicates normal, and the highest score indicates dehydrated (above 10%).

``` {r dogs}
print(dogs)
```


## The basic main effects model

The baseline model for rater modeling is the main effect model, which specifies an intercept and a row (rater 1) effect and a column (rater 2 effect), but no interaction terms.  This is the independence model that underlies Cohen's kappa for nominal items:
             p_a - p_c
    kappa = -----------
              1 - p_c
```{r kappa_dogs}
dogs_kappa <- kappa(dogs)
print(dogs_kappa)
```

p_c is taken as p_i+ times p_+j, which is the model prediction for all cells in the main effects model.  There is also a weighted version of kappa available, weighted_kappa(n, w) where w is the matrix of weights.  The unweighted kappa for the dogs data is `r kappa(dogs)`, indicating good agreement betweeen the two clinicians.

There are two models that specifically include kappa as part of the model.  The first is by Agresti, 1989, and the second is by Schuster, 2001. The Agresti model is described as "symmetry plus quasi-independence."  It assumes marginal homogeneity, as well as assuming that quasi-symmetry holds.

The Agresti kappa model is available as:
```{r agresti_kappa}
result <- Agresti_kappa_agreement(dogs)
```
The result is a list with several elements.  The first is kappa, `r result$kappa`, which is quite similar to the observed value.  Estimates of the marginal parameters (the moodel assumes that these fit to both raters) are also returned, `r result$pi_margin`. The Pearson chi-squared statistic of `r result$chisq` on `result$df` degrees of freedom indicates that the model does not fit well to this data.

The Schuster model also assumes marginal homogeneity. It incorporates an underlying symmetry matrix as well. The model is much harder to fit acceptably, as reflected in the amount of time and number of iterations that the model runs. It does not yield a plausible result for the dogs data (not shown: the log(likelihood) decreased, the estimated kappa was negative and the residuals were remarkably large for the highest category), so we apply it to the vision_data.

```{r schuster_kappa}
result <- Schuster_symmetric_rater_agreement_model(vision_data)
```
The iterations are slow (about 1 sec per 100 iterations). It runs for the full 10,000 iterations specified by default, in spite of using the Newton-Raphson method (it is likely that the performance of this function will be improved in future versions of the package).  At every 100 cycles the function outputs the iteration number, log(likelihood), G^2 and X^2 and the convergence value, whihc is the relative improvement in log(likelihood) value.  For the vision data the observed kappa is `r kappa(vision_data)`.  The model's final estimate is close to this, `r result$kappa`.  It is a bit hard to evaluate the model fit, given that the iterations did not converge, but the G^2 values that were printed out indicate very poor fit to the data (the X^2 values were even more extreme). Running to convergence of 1.0e-9 requires 244,503 iterations.  The final kappa estimate of 0.5953344 is very close to the observed kappa of `r kappa(vision_data)`. The final model yields a G^2 of 5409.781 on 9 degrees of freedom.

We note that the Agresti model:

```{r agresti_kappa_vision}
result <- Agresti_kappa_agreement(vision_data)
```
runs substantially more quickly and fits this data set substantially better, with G^2 of `r result$g_squared` on `r result$df` degrees of freedom, although it does not fit in an absolute sense.

To show that the kappa-based models can fit, we look at the budget_actual data.  Fitting this model:

```{r schuster_kappa_budget}
s_result <- Schuster_symmetric_rater_agreement_model(budget_actual)
```

yields acceptable fit (G^2 = `r s_result$g_squared` on `r s_result$df` degrees of freedom), as does fitting the Agresti kappa model:

```{r agresti_kappa_budget}
a_result <- Agresti_kappa_agreement(budget_actual)
```

which gives a G^2 of `r a_result$g_squared` on `r a_result$df` degrees of freedom.  Again, the two kappa estimates, `r s_result$kappa` and `r a_result$kappa` are not that different than the observed kappa of `r kappa(budget_actual)`. 

It is worth noting that symmetry (which is assumed by both kappa models) fits the budget_actual data: 

``` {r stuart_budget}
stuart_result <- Stuart_marginal_homogeneity(budget_actual)
```

yields a X^2 of `r stuart_result$chisq` on `r stuart_result$df` degrees of freedom, indicating that symmetry is plausible for this data.


In the final analysis, neither of these specialized kappa models does a good job of describing the vision data.  Both models assume marginal homogeneity, and we know from the vignette "Checking Whether Margins are (Stochastically) Ordered" that marginal homogeneity can be rejected for the vision data.  When marginal homogeneity was satisfied in the budget_actual data, both models yielded acceptable fit. It appears that for these models to fit, marginal homogeneity must be satisfied.  Conversely, to fit a wide variety of rater data sets acceptably, it appears that a (kappa-based) model must allow for heterogeneous margins.


# Regular Log-linear Models

For the fitting of general log-linear models, the function fit_log_Linear(data, design_matrix) is provided.  It uses the glm() function, with family set to poisson(link="log"),  This generates results that match various published results of specific log-linear models in the
literature.

To aid in fitting these models, several design matrices are available with names beginning "log_linear".  There are methods:

* create a basic design matrix for the independent rows and columns model (main_effects_design)
* create a matrix with a single diagonal parameter delta (equal_weight_agreement_design)
* add all diagonals terms for a model with a separate kappa per category
* create a linear-by-linear term 
* append a new column (such as the linear-by-linear) to a design matrix
* remove a column from a design matrix
* convert a matrix of observed counts into a vector that aligns with the design matrices

These features are intended to aid in fitting additional models not (yet) implemented in this package.

## Main Effects Model

```{r main_effects}
result <- von_Eye_main_effect(dogs)
```

This line creates the design matrix using model.matrix and specifying contrast coding.  The result is a list.  The first element is beta `r result$beta`, the vector of regression parameters. Standard errors of beta are given in `r result$se`. The expected counts are returned as `r result$expected` is a vector containing expected counts.  The X^2 is returned `r result$chisq`, as well as G^2 `r result$g_squared` and the degrees of freedom `r result$df`.  Finally, the covariance matrix for the regression parameters is returned as the cov member of the list.  The model obviously does not fit these data.  Examining the residuals `r dogs - result$expected` reveals very large, positive residuals where the diagonal is being under-predicted, and negative entries elsewhere, where the model over-predicts.


## The Weight by Response Category model

This model adds to the main effect model by placing specific elements for each of the diagonal terms, delta_ii.  The weights applied to each diagonal cell are equal, and only a single parameter is fit for the table.

```{r weight_response}
result2 <- von_Eye_equal_weighted_diagonal(dogs)
```

While it still does not fit, this model is a noticeable improvement over the previous one.  The reduction in G^2 is `r result$g_squared - result2$g_squared` on 1 degree of freedom.

From a substantive point of view, it may be that certain agreements are more important than others.  For example, in a medical setting it may be that the extreme two categories with scores of 1 and 4 are more important than disagreement between the middle two categories.  So, one might assign weights of 1 to the 1,1 cell and the 4,4 cell, and weights of 3 to the middle 2,2 and 3,3 cells.  To fit this model, the first step is to create the vector of weights, and then modify the design matrix to apply the weights to the cells.

```{r weights}
w <- c(3, 1, 1, 3)
x <- log_linear_main_effect_design(dogs)
x_prime <- log_linear_add_all_diagonals(dogs, x)
x_prime_prime <- von_Eye_weight_by_response_category_design(dogs, x_prime, w)
result3 <- log_linear_fit(dogs, x_prime_prime)
```

Because the weights are specified a priori, the degrees of freedom for this model are the same as for the equal weight agreement design.  The G^2 is improved, decreasing by `r result3$g_squared - result2$g_squared`.  The overall fit, is marginal, with a G^2 of `r result3$g_squared` on `r result$df` degrees of freedom, with a p-value of `r pchisq(result3$g_squared, result3$df, lower.tail=FALSE)`.


## Unequal Weights on the Diagonal

A less constrained model allows the parameters for the diagonal to be estimated.

``` {r all_diagonals}
result4 <- von_Eye_diagonal(dogs)
```

This fits better than the weight_by response model, with a G^2 of `r result4$g_squared` on `r result4$df` degrees of freedom.  However, it still demonstrates a significant lack of fit.


## Agresti's Model for Ordinal Agreeement

Agresti (1988) suggested a modification to the last model that includes a linear-by-linear component.  This can be generated using log_linear_create_linear_by_linear(dogs). This creates a linear-by-linear predictor to add to the model, using the integer scores. The function also takes an optional parameter "centered", which indicates whether the linear terms should be centered by having the mean subtracted from each entry before they are multiplied together.  It returns a vector containing the new predictor.  This can be added to an existing design matrix using log_linear_append_column().  This takes the existing design matrix and the vector x_new to be added, and returns a new design matrix.  There is an optional third parameter that specifies what column x_new should reside in the new design matrix.  This defaults to placing the column on the far right.

``` {r add linear_by_linear}
linear <- log_linear_create_linear_by_linear(dogs, centered=TRUE)
print(linear)
x <- log_linear_main_effect_design(dogs)
x_new <- log_linear_append_column(x, linear)
print(x_new)
result5 <- log_linear_fit(dogs, x_new)
```

This model can also be specified directly:
``` {r von_Eye_linear_by_linear}
result5b <- von_Eye_linear_by_linear(dogs)
```

Just including the linear-by-linear term does not fit, although the decrease in G^2 is large, `r result5$g_squared -result2$g_squared`.  Adding the diagonal terms:

```{r agresti_rating_model}
result6 <- von_Eye_diagonal_linear_by_linear(dogs)
```

provides a large improvement in fit, `r result5$g_squared - result6$g_squared` on `r result5$df - result6$df`.  This model's fit, `r result6$g_squared` on `r result6$df` degrees of freedom is non-significant, indicating an acceptable fit.

One question is whether the separate diagonal terms are needed, or whether they can be constrained to be equal.  Fitting this model is straightforward with the tools in this vignette.

``` {r final_model}
x <- log_linear_equal_weight_agreement_design(dogs)
linear <- log_linear_create_linear_by_linear(dogs, centered=TRUE)
x_final <- log_linear_append_column(x, linear)
result8 <- log_linear_fit(dogs, x_final)
```

The final result has a G^2 of `r result8$g_squared` on `r result8$df` indicating acceptable fit.  This model is more parsimonious than the previous one, while still demonstrating acceptable fit.

## References
Agresti, A. (1988). A model for agreement between ratings on an ordinal scale.  Biometrics, 44(2), 539-548.
Agresti, A. (1989). An agreement model with kappa as parameter.  Statistics and Probability Letters, 7, 271-273.
Schuster, C. (2001). Kappa as a parameter of a symmetry model for rater agreement.  Journal of Educational and Behavioral Statistics, 26(3), 331-342.
Shoukri, M. M. (2005). Measures of interobserver agreement. New York: Chapman & Hall.
von Eye, A. & Mun, K. Y. (2005). Analyzing rater agreement: Manifest variable methods.  Mahwah, NJ: Lawrence Erlbaum.