---
title: "tidyCat: Tidy Methods For Categorical Data Analysis"
author: "Michael Friendly"
date: "`r Sys.Date()`"
package: vcdExtra
output:
rmarkdown::html_vignette:
fig_caption: yes
bibliography: ["vcd.bib", "vcdExtra.bib"]
csl: apa.csl
vignette: >
%\VignetteIndexEntry{tidyCat: Tidy Methods For Categorical Data Analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
message = FALSE,
warning = FALSE,
fig.height = 6,
fig.width = 7,
# fig.path = "fig/tidycat-",
dev = "png",
comment = "##"
)
```
> Frequency tables need some Tidy Love ❤️
Tidy methods for quantitative data & models have advanced considerably, but there hasn't been much
development of similar ideas for "categorical data", by which I mean data that is often compactly represented
as $n$-way frequency tables, cross classified by one or more discrete factors.
What would it take to implement a tidy framework for such data? These notes are, in effect, a call for
participation in developing a `tidyCat` package for this purpose. Other possible names for this:
`tidyCDA`, `tidyfreq` but `tidyCat` makes for a nice logo!
```{r load-pkgs}
library(MASS)
library(vcdExtra)
```
I see three areas that could be developed here:
## Constructing categorical data sets
Current non-tidy data forms and operations, following [@FriendlyMeyer:2016:DDAR] are described in
the vignette [Creating and manipulating frequency tables](https://friendly.github.io/vcdExtra/articles/a1-creating.html)
It seems clear that the most flexible and general form, and one that most closely matches a tidy
data frame is **case form**, because this allows for numeric variables as well. Thus:
> Among these, tidy categorical data should best be represented as a `tibble` in **case form**. A tibble is convenient for its' printing.
### Manipulating categorical data sets
The methods `xtabs()`, `table()` and `expand.dft()` described in that vignette allow conversion from one form to another.
Tidy equivalents might be:
* `as_table()`, `as_matrix()`, `as_array()` to convert from any form to table/array form
* Similarly, perhaps `as_caseform()`, `as_freqform()` to convert to those. There is already `as.data.frame(table)` to convert to frequency form, and `expand.dft()` converts that to case form
```{r}
data("HairEyeColor")
hec.df <- as.data.frame(HairEyeColor)
head(hec.df)
# expand to case form
expand.dft(hec.df) |> head()
```
* `vcd::structable()` (and `stats::ftable()`) produce a 'flat' representation of a high-dimensional contingency table constructed by recursive splits (similar to the construction of mosaic displays). One can be constructed from a table or from a data frame with
a formula method,
```{r}
structable(Titanic)
structable(Sex + Class ~ Survived + Age, data = Titanic)
```
and there are a suite of methods for indexing and selecting parts of an $n$-way table.
* The methods in the `plyr` package (now retired) provided a coherent set of tools for a split-apply-combine
strategy that works nicely with multidimensional arrays. Perhaps there are some useful ideas for frequency tables
that could be resurrected here.
* There is also a role for `purrr` methods and thinking here: $n$-way tables as nested lists/arrays?
The ideas of mapping over these?
## Manipulating factor levels
Also needed:
* methods for **recoding and collapsing** the levels of a factor: `forcats::fct_recode()`, `forcats::fct_collapse()`, `forcats::fct_lump_min()` are useful here.
* methods for **reordering the levels** of a factor, either manually or for some analysis purpose. For example, Data from @Glass:54 gave this 5 x 5 table on the occupations of 3500 British fathers and their sons, where the occupational categories are listed in alphabetic order.
```{r glass}
data(Glass, package="vcdExtra")
str(Glass)
(glass.tab <- xtabs(Freq ~ father + son, data=Glass))
```
This can be reordered manually by indexing, to arrange the categories by **status**, giving an order `Professional` down to `Unskilled`:
```{r glass-order}
# reorder by status
ord <- c(2, 1, 4, 3, 5)
glass.tab[ord, ord]
```
A more general method is to permute the row and column categories in the order implied by correspondence analysis dimensions. This is implemented in the [`seriation` package](https://cran.r-project.org/package=seriation) using the `CA` method of `seriation::seriate()` and applying `permute()` to the result.
```{r housetasks-seriation}
library(seriation)
order <- seriate(glass.tab, method = "CA")
# the permuted row and column labels
rownames(glass.tab)[order[[1]]]
# reorder rows and columns
permute(glass.tab, order)
```
What are tidy ways to do these things?
## Models
The standard analysis of frequency data is in the form of loglinear models fit by `MASS::loglm()`
or with more flexible versions fit with `glm()` or `gnm::gnm()`. These are essentially linear models for the log
of frequency. In `vcdExtra`, there are several methods for a list of such models, of class `"glmlist"` and `"loglmlist"`,
and these should be accommodated in tidy methods.
What is needed are `broom` methods for `loglm` models. The information required is accessible from standard functions, but not in a tidy form.
* `glance.loglm()` -- **model level** statistics. These are given in the output of the `print()` method, and available from the `print()` method. A complication is both LR and Pearson $\chi^2$ are reported, so these would need to be made to appear in separate columns. There are also related `LRstats()` functions in `vcdExtra`, which report `AIC` and `BIC`.
```{r}
hec.indep <- loglm(~Hair+Eye+Sex, data=HairEyeColor)
hec.indep
# extract test statistics
summary(hec.indep)$tests
LRstats(hec.indep)
```
* `tidy.loglm()` --- **coefficient level** statistics. These are available from `coef.loglm()`. They would need to assembled into a long format. Standard errors & p-values might be a problem.
```{r}
coef(hec.indep)
```
* `augment.loglm()` --- should give **case level** statistics: fitted values, residuals, ...
```{r}
fitted(hec.indep)
residuals(hec.indep)
```
What about `hatvalues`? Not implemented, but shouldn't be too hard.
```{r error=TRUE}
hatvalues(hec.indep)
```
## Graphical methods
The most common graphical methods are those implemented in `vcd`: `mosaic()` association plots (`assoc()`), ..., which
rely on `vcd::strucplot()` described in [@vcd:Meyer+Zeileis+Hornik:2006b].
Is there a tidy analog that might work with `ggplot2`? The [`ggmosaic` package](https://github.com/haleyjeppson/ggmosaic) implements basic marimeko-style mosaic plots. They are not very general, in that they cannot do residual-based shading to show the patterns of association.
However, they are based on a [productplots](https://github.com/hadley/productplots) package by Hadley, which seems to
provide some basic structure for constructing such displays of nested rectangles.
## Some tidy stuff
* The [infer package](https://infer.netlify.app/) has some interesting stuff for CDA.
The vignette [Tidy Chi-Squared Tests with infer](https://infer.netlify.app/articles/chi_squared)
expresses do a Chisq test of independences as:
```
observed_indep_statistic <- gss |>
specify(college ~ finrela) |>
hypothesize(null = "independence") |>
calculate(stat = "Chisq")
```
You can do simulation tests like this:
```
null_dist_sim <- gss |>
specify(college ~ finrela) |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "Chisq")
```
## References