---
title: "Demonstrating Categorical Predictors in Discordant-Kinship Regressions"
author: Yoo Ri Hwang and S. Mason Garrison
bibliography: references.bib
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Demonstrating Categorical Predictors in Discordant-Kinship Regressions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Introduction

## Purpose

This vignette demonstrates how to incorporate categorical predictors in discordant-kinship regression analyses using the `discord` package.  Building on dyadic modeling strategies from Kenny et al. [@kenny2006] and extensions to kinship models in our prior work [@Hwang2022], we present several approaches for incorporating categorical variables into these specialized regression models.

We focus on:

1. How to code and interpret categorical variables at different levels (between-dyads vs. mixed)
2. Methods for treating categorical predictors (binary match vs. multi-match)
3. Practical implementation using the `discord` package

To illustrate these concepts, we examine whether sex and race predict socioeconomic status (SES) at age 40, using different categorical coding schemes applied to data from the 1979 National Longitudinal Survey of Youth (NLSY79).


## Understanding Variable Types in Dyadic Analysis

In dyadic analysis, categorical predictors can operate at different levels:

- **Between-dyad variables**: Variables that are constant within dyads.
  - Example: For full siblings, race is often a between-dyad variable.
- **Within-dyad variables**: Variables that vary within dyads but remain constant across dyads (rare for categorical variables).
  - Example: Division of chores between roommates (must sum to 100%)
- **Mixed variables**: Vary both within and across dyads
  - Example: Biological sex in non-MZ twin siblings; age; personality traits


We summarize these distinctions below:


| Variable Type | Definition | Examples | Analytic Implications |
|---------------|------------|----------|----------------------|
| Between-dyads | Members of the same pair have identical values | Race in same-race siblings; Length of marriage in couples | Simplifies analysis; Functions like individual-level variable |
| Within-dyads | Varies within pairs but constant across pairs | Division of chores between roommates (must sum to 100%) | Rare for categorical variables; Requires specialized handling |
| Mixed | Varies both within and across pairs | Sex in sibling pairs; Age; Personality traits | Most complex; Requires transformation to dyad-level variables |

For continuous predictors, researchers typically compute mean and difference scores. However, as noted by Kenny et al. [@kenny2006], these calculations are not meaningful for categorical variables and should instead be replaced with appropriate coding strategies.


## Coding Approaches for Categorical Variables

The `discord` package implements two primary coding strategies for categorical predictors:

1. **Binary Match Coding**: Creates a binary indicator of whether pairs match (1) or differ (0) on the categorical variable.


2. **Multi-Match Coding**: Retains specific category combinations

### Binary Match Coding

Binary match coding creates a simple indicator of whether pairs match (1) or differ (0) on a categorical variable:
- Same-sex pairs (male-male or female-female) → 1
- Mixed-sex pairs (male-female) → 0

**Use case**: When the research question focuses on similarity versus difference, rather than effects of specific categories.

### Multi-Match Coding

Multi-match coding retains specific category information:

- Male-male pairs → "MALE" 
- Female-female pairs → "FEMALE"
- Mixed pairs → "mixed"

**Use case**: When distinct effects for specific categories are hypothesized (e.g., male pairs vs. female pairs).

Following Hwang and Garrison [@Hwang2022], the coding approach you select should align with your research questions and theoretical framework.


# Data Preparation


We demonstrate how to handle categorical predictors using sibling data from the NLSY79. We begin by loading the necessary packages and preparing the dataset.


## Package Loading and Data Setup
 
```{r setup, message = FALSE}
# Loading necessary packages and data
# For easy data manipulation
library(dplyr)
# For kinship linkages
library(NlsyLinks)
# For discordant-kinship regression
library(discord)
# pipe
library(magrittr)

# data
data(data_flu_ses)
```

We then filter and clean the dataset, create kinship links, and recode the categorical variables. This example uses full siblings (R = 0.5) from the Gen1 cohort.


```{r set-df-link}
# for reproducibility
set.seed(2023)

link_vars <- c("S00_H40", "RACE", "SEX")

# Specify NLSY database and kin relatedness

link_pairs <- Links79PairExpanded %>%
  filter(RelationshipPath == "Gen1Housemates" & RFull == 0.5)
```

We use `CreatePairLinksSingleEntered()` from the `NlsyLinks` package to merge kinship links with our target variables. This merge creates a sibling-pair dataset in wide format. Suffixes distinguish the two individuals in each pair.


```{r}
df_link <- CreatePairLinksSingleEntered(
  outcomeDataset = data_flu_ses,
  linksPairDataset = link_pairs,
  outcomeNames = link_vars
)
```


To ensure the dependent variable (SES at age 40) is available for both siblings, we remove cases with missing values. We also recode sex and race into factors.

```{r}
# We removed the pair when the Dependent Variable is missing.
df_link <- df_link %>%
  filter(!is.na(S00_H40_S1) & !is.na(S00_H40_S2)) %>%
  mutate(
    SEX_S1 = case_when(
      SEX_S1 == 0 ~ "MALE",
      SEX_S1 == 1 ~ "FEMALE"
    ),
    SEX_S2 = case_when(
      SEX_S2 == 0 ~ "MALE",
      SEX_S2 == 1 ~ "FEMALE"
    ),
    RACE_S1 = case_when(
      RACE_S1 == 0 ~ "NONMINORITY",
      RACE_S1 == 1 ~ "MINORITY"
    ),
    RACE_S2 = case_when(
      RACE_S2 == 0 ~ "NONMINORITY",
      RACE_S2 == 1 ~ "MINORITY"
    )
  )
```

For full siblings, race is a between-dyad variable. In this example, we restrict the analyses to same-race pairs.


```{r}
df_link <- df_link %>%
  dplyr::filter(RACE_S1 == RACE_S2)
```

To avoid violating assumptions of independence, we retain only one sibling pair per household:


```{r}
df_link <- df_link %>%
  group_by(ExtendedID) %>%
  slice_sample() %>%
  ungroup()
```

## Handling Categorical Predictors

### Mixed Variables: Sex as an Example

Sex is a classic example of a mixed variable in sibling studies because it can vary both within and between dyads. Siblings may be of the same or different sexes, and the composition varies across families.

We use the `discord_data()` function to prepare the data for analysis.


```{r}
cat_sex <- discord_data(
  data = df_link,
  outcome = "S00_H40",
  sex = "SEX",
  race = "RACE",
  demographics = "sex",
  predictors = NULL,
  pair_identifiers = c("_S1", "_S2"),
  coding_method = "both"
)
```

In the restructured data, the individual with the higher SES (the dependent variable) is labeled "_1" and the other is labeled "_2". This gives us the following sex compositions:


```{r sex, echo = FALSE, eval = knitr::is_html_output(),error=FALSE}
cat_sex <- cat_sex %>%
  dplyr::mutate(SEX_binarymatch = case_when(
    SEX_binarymatch == 0 ~ "mixed-sex",
    SEX_binarymatch == 1 ~ "same-sex"
  ))

cat_sex %>%
  slice(1:500) %>%
  slice_sample(n = 6) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling(full_width = FALSE)
```

The dataset contains the following sex pairings:


```{r preview-sex, echo = FALSE, eval = knitr::is_html_output()}
cat_sex %>%
  group_by(SEX_1, SEX_2) %>%
  summarize(n(), .groups = "drop") %>%
  kableExtra::kbl("html", align = "c", col.names = c("SEX_1", "SEX_2", "sample_size")) %>%
  kableExtra::kable_styling(full_width = FALSE)
```

By default, the `SEX_1` variable indicates the sex of the individual who has the higher DV within the pair, and the `SEX_2` variable indicates the sex of the other member of the dyad.

<!--
In this example, the dependent variable (DV) is `S00_H40_diff`, the difference score of socio-economic status (SES) of the pair at age 40. Considering the DV (`S00_H40_diff`) is a between-dyads variable, we can make the sex variable a between-dyads variable as well. 


If we put the individual's sex in the pair as a predictor in the regression model, the individual's sex is an individual-level predictor, while DV (`S00_H40_diff`) is not an individual-level variable. This means that variables are not at a comparable level, hindering meaningful interpretation. However, by forcing the sex variable to be a between-dyads variable by using a gender-composition variable, results can be more interpretable and meaningful. To sum up, using the sex composition of the pair as a predictor in the discordant-kinship regression model can yield more meaningful results, rather than using an individual's sex as a predictor in the regressions.

So, how to force the sex variable to be a sex-composition variable? The above table shows all the possible combinations of sex composition: 1) male-male, 2) male-female, 3) female-male, and 4) female-female. However, it is hard to believe that the distinction between "female-male" pairs and the "male-female" is meaningful unless there are strong theoretical reasons behind it because the "_S1" and "_S2" were assigned by the DV value of the participants. Specifically, the "female-male" pair means that the one with a higher DV in the pair is "female" and the other one was "male", and the "male-female" pair means that the one with a higher DV in the pair was "male" and the other one was "female". 

Thus, we can utilize the following sex-composition variables. First, we can use sex-composition variables that have three factors: 1) female-female,2) "male-female" and "female-male", and 3) "male-male".  Or, we can also use sex-composition variables that have two factors: 1) "same-sex", and 2) mixed sex. The `discord_data()` function and `discord_regression()` function utilize these two options; the binary match variable utilizes the former categorizations, and the multi-match variable utilizes the latter categorization. 
-->

As shown, the `discord_data()` function generates both `SEX_binarymatch` and `SEX_multimatch`  variables. This recodes the sex variable—which initially varied within and between dyads—into between-dyad variables:

```{r sex-compositions, echo = FALSE, eval = knitr::is_html_output()}
cat_sex %>%
  group_by(SEX_binarymatch, SEX_multimatch, SEX_1, SEX_2) %>%
  summarize(n(), .groups = "drop") %>%
  kableExtra::kbl("html", align = "c", col.names = c("binary", "multi", "SEX_1", "SEX_2", "sample_size")) %>%
  kableExtra::kable_styling(full_width = FALSE)
```
 
Researchers can choose between these options depending on their research question:
- Use `SEX_binarymatch`  when comparing same-sex and mixed-sex pairs.
- Use `SEX_multimatch` when comparing male-male, female-female, and mixed-sex pairs.

### Between-dyad Variable: Race as an Example

For demonstration purposes, we have already restricted the dataset to include only same-race pairs, making race a between-dyad variable. We now prepare the data specifically for testing whether there are racial differences in SES discordance.


```{r}
set.seed(2023) # for reproducibility

# Prepare data with race as demographic variable
cat_race <- discord_data(
  data = df_link,
  outcome = "S00_H40",
  predictors = NULL,
  sex = "SEX",
  race = "RACE",
  demographics = "race",
  pair_identifiers = c("_S1", "_S2"),
  coding_method = "both"
)
```

The race compositions in the dataset are as follows:


```{r preview-race, echo = FALSE, eval = knitr::is_html_output()}
cat_race <- cat_race %>%
  dplyr::mutate(RACE_binarymatch = case_when(
    RACE_binarymatch == 0 ~ "mixed-race",
    RACE_binarymatch == 1 ~ "same-race"
  ))

cat_race %>%
  group_by(RACE_binarymatch, RACE_multimatch, RACE_1, RACE_2) %>%
  summarize(n(), .groups = "drop") %>%
  kableExtra::kbl("html",
    align = "c",
    col.names = c("RACE_binarymatch", "RACE_multimatch", "RACE_1", "RACE_2", "sample_size")
  ) %>%
  kableExtra::kable_styling(full_width = FALSE)
```

Because we filtered for same-race pairs, all pairs have RACE_binarymatch = "same-race." When using NLSY data, the RACE_multimatch variable distinguishes between the three categories used by the Bureau of Labor Statistics (Black, Hispanic, and Non-Black, Non-Hispanic).

The `RACE_binarymatch` variable indicates whether the pair is same-race or mixed-race. As all pairs are same-race in this sample, this variable does not vary. The `RACE_multimatch` variable classifies pairs into one of three categories:
The RACE_binarymatch variable indicates whether the pair is same-race or mixed-race. As all pairs are same-race in this sample, this variable does not vary. The RACE_multimatch variable classifies pairs into one of three categories:
Since we filtered for same-race pairs only, all pairs have RACE_binarymatch = "same-race". When using NLSY data, the RACE_multimatch variable distinguishes between the three groupings that the bureau of labor statistics uses (Black, Hispanic, and Non-Black, Non-Hispanic).

The `RACE_binarymatch` variable indicates whether the pair is the same-race pair or mixed-race pair .As all pairs are same-race in this sample, this variable does not vary within dyad. The `RACE_multimatch`  The RACE_multimatch variable classifies pairs into one of three categories:
- Minority-minority,

- Nonminority-nonminority,

- Discordant (unused in this restricted sample).

### Combining Binary and Multi-match Variables

We can also prepare data that includes multiple demographic variables.

```{r}
# for reproducibility

set.seed(2023)

cat_both <- discord_data(
  data = df_link,
  outcome = "S00_H40",
  predictors = NULL,
  sex = "SEX",
  race = "RACE",
  demographics = "both",
  pair_identifiers = c("_S1", "_S2"),
  coding_method = "both"
)
```

The demographic variables can be used in the same regression model. 

```{r processboth}
cat_both <- cat_both %>%
  dplyr::mutate(
    RACE_binarymatch = case_when(
      RACE_binarymatch == 0 ~ "mixed-race",
      RACE_binarymatch == 1 ~ "same-race"
    ),
    SEX_binarymatch = case_when(
      SEX_binarymatch == 0 ~ "mixed-sex",
      SEX_binarymatch == 1 ~ "same-sex"
    )
  )
```

```{r preview-both, echo = FALSE, eval = knitr::is_html_output()}
cat_both %>%
  group_by(RACE_multimatch, RACE_1, RACE_2, SEX_binarymatch, SEX_multimatch, SEX_1, SEX_2) %>%
  summarize(n(), .groups = "drop") %>%
  kableExtra::kbl("html",
    align = "c",
    col.names = c(
      "RACE_multi", "RACE_1", "RACE_2", "SEX_binary", "SEX_multi", "SEX_1", "SEX_2",
      "sample_size"
    )
  ) %>%
  kableExtra::kable_styling(full_width = FALSE)
```

In this table:

- `RACE_multimatch` shows whether pairs are minority-minority or nonminority-nonminority.

- `SEX_binarymatch` distinguishes same-sex and mixed-sex pairs.

- `SEX_multimatch` identifies male-male, female-female, and mixed-sex pairings.

# Results and Interpretation

## Regression Analysis: Sex Variables

### Binary Match Coding for Sex

First, we test whether same-sex versus mixed-sex pairs differ in SES discordance.

The regression model can be conducted as such:

```{r}
discord_sex_binary <- discord_regression(
  data = df_link,
  outcome = "S00_H40",
  sex = "SEX",
  race = "RACE",
  demographics = "sex",
  predictors = NULL,
  pair_identifiers = c("_S1", "_S2"),
  coding_method = "binary"
)
```


```{r, echo = FALSE, eval = knitr::is_html_output()}
discord_sex_binary %>%
  broom::tidy() %>%
  mutate(
    p.value = scales::pvalue(p.value, add_p = TRUE),
    across(.cols = where(is.numeric), ~ round(.x, 3))
  ) %>%
  rename(
    "Standard Error" = std.error,
    "T Statistic" = statistic
  ) %>%
  rename_with(~ snakecase::to_title_case(.x)) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(1:5, extra_css = "text-align: center;")
```

#### Interpretation:

We predict sibling differences in SES (`S00_H40_diff`).

- The mean SES score (`S00_H40_mean`)  is a significant control variable (p =`r round(summary(discord_sex_binary)[["coefficients"]]["S00_H40_mean", "Pr(>|t|)"],3)`).  `S00_H40_mean` is negatively associated with the difference in SES score between siblings at age 40, controlling for another variable (in this case, `SEX_binarymatch`). For one unit increase of `S00_H40_mean`, `S00_H40_diff`is expected to decrease approximately `r round(discord_sex_binary[["coefficients"]][["S00_H40_mean"]],3)`.

- The binary sex match variable `SEX_binarymatch` is not a significant predictor (p = `r round(summary(discord_sex_binary)[["coefficients"]]["SEX_binarymatch", "Pr(>|t|)"],3)`), when controlling for `S00_H40_mean`. There is no significant differences between same-sex pairs and mixed-sex pairs in `S00_H40_diff`. This means that the difference between same-sex pairs and mixed-sex pairs does not significantly predict the `S00_H40_diff` in the pair when controlling for `S00_H40_mean`. <!--It is estimated that, compared to the reference group (the mixed-sex pairs), the same-sex pairs would have approximately `r round(discord_sex_binary[["coefficients"]][["SEX_binarymatch"]],3)` higher difference score of `S00_H40_diff`, when controlling for `S00_H40_mean`. However, this coefficient is not significant, so it is not advisable to interpret the coefficient. -->


### Multi-Match Coding for Sex

Next, we examine whether male-male, female-female, and mixed-sex pairs differ in SES discordance.

The regression model can be conducted as such:

```{r }
discord_sex_multi <- discord_regression(
  data = df_link,
  outcome = "S00_H40",
  sex = "SEX",
  race = "RACE",
  predictors = NULL,
  pair_identifiers = c("_S1", "_S2"),
  coding_method = "multi"
)
```

```{r, echo = FALSE, eval = knitr::is_html_output()}
discord_sex_multi %>%
  broom::tidy() %>%
  mutate(
    p.value = scales::pvalue(p.value, add_p = TRUE),
    across(.cols = where(is.numeric), ~ round(.x, 3))
  ) %>%
  rename(
    "Standard Error" = std.error,
    "T Statistic" = statistic
  ) %>%
  rename_with(~ snakecase::to_title_case(.x)) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(1:5, extra_css = "text-align: center;")
```

#### Interpretation:


- The term `S00_H40_mean`  was a significant control variable (p = `r round(summary(discord_sex_multi)[["coefficients"]]["S00_H40_mean","Pr(>|t|)"],3)`). This means that the mean SES score for the sibling pairs (`S00_H40_mean`) is negatively associated with the difference in SES between siblings (`S00_H40_diff`), controlling for other variables (in this case, `SEX_multimatch`). It is estimated that for one unit increase of `S00_H40_mean`, `S00_H40_diff` is expected to decrease approximately `r abs(round(discord_sex_multi[["coefficients"]][["S00_H40_mean"]],3))`. 

- There was no significant difference between female-female pairs and male-male pairs (p=`r round(summary(discord_sex_multi)[["coefficients"]]["SEX_multimatchMALE", "Pr(>|t|)"],3)`) to predict `S00_H40_diff`. Similarly, there were no significant differences between mixed-sex pairs and female-female pairs (p = `r round(summary(discord_sex_multi)[["coefficients"]]["SEX_multimatchmixed", "Pr(>|t|)"],3)`).

- The coefficient `r abs(round(discord_sex_multi[["coefficients"]][["SEX_multimatchMALE"]],3))` is the difference between the expected `S00_H40_diff` for the reference group (in this case, the female-female pairs) and the male-male pairs. 

- The coefficient `r abs(round(discord_sex_multi[["coefficients"]][["SEX_multimatchmixed"]],3))` is the difference between the expected `S00_H40_diff` for the reference group (in this case, the female-female pairs) and the mixed-sex pairs. However, these coefficients are not significant, so it is not advisable to interpret the coefficients. 

### Mean SES Model with Sex

We can also examine whether sex composition predicts mean SES levels (rather than SES differences):

```{r}
discord_cat_mean <- lm(S00_H40_mean ~ SEX_binarymatch,
  data = cat_sex
)
```
```{r, echo = FALSE, eval = knitr::is_html_output()}
discord_cat_mean %>%
  # for nicer regression output
  broom::tidy() %>%
  mutate(
    p.value = scales::pvalue(p.value, add_p = TRUE),
    across(.cols = where(is.numeric), ~ round(.x, 3))
  ) %>%
  rename(
    "Standard Error" = std.error,
    "T Statistic" = statistic
  ) %>%
  rename_with(~ snakecase::to_title_case(.x)) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(1:5, extra_css = "text-align: center;")
```

#### Interpretation:

In this regression model, the mean SES score for the siblings (`S00_H40_mean`) was regressed on the SEX-composition variable (`SEX_binarymatch`). 

There is no significant difference between same-sex pairs and mixed-sex pairs in the mean SES score for the siblings (p=`r round(summary(discord_cat_mean)[["coefficients"]]["SEX_binarymatchsame-sex", "Pr(>|t|)"],3)`)

It is estimated that compared to the mixed-sex pairs, the same-sex pairs would have approximately `r abs(round(discord_cat_mean[["coefficients"]][["SEX_binarymatchsame-sex"]],3))` higher `S00_H40_mean`. However, this coefficient is not significant, so it is not advisable to interpret the coefficient. 

### Multi-Match Coding for Sex

```{r}
discord_cat_mean2 <- lm(S00_H40_mean ~ SEX_multimatch,
  data = cat_sex
)
```
```{r echo = FALSE, eval = knitr::is_html_output()}
discord_cat_mean2 %>%
  # for nicer regression output
  broom::tidy() %>%
  mutate(
    p.value = scales::pvalue(p.value, add_p = TRUE),
    across(.cols = where(is.numeric), ~ round(.x, 3))
  ) %>%
  rename(
    "Standard Error" = std.error,
    "T Statistic" = statistic
  ) %>%
  rename_with(~ snakecase::to_title_case(.x)) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(1:5, extra_css = "text-align: center;")
```

#### Interpretation:

There is a significant difference between female-female pairs and male-male pairs (`r round(summary(discord_cat_mean2)[["coefficients"]]["SEX_multimatchMALE", "Pr(>|t|)"],3)`) to predict the `S00_H40_mean`. However, there is  no significant difference between mixed-sex pairs and female-female pairs (p = `r round(summary(discord_cat_mean2)[["coefficients"]]["SEX_multimatchmixed", "Pr(>|t|)"],3)`).

The coefficient `r abs(round(discord_cat_mean2[["coefficients"]]["SEX_multimatchMALE"],3))` is the difference between the expected `S00_H40_mean` (the mean SES score for the siblings) for the reference group (in this case, the female-female pairs) and the male-male pairs. It can be concluded that male-male pairs and female-female pair has significant differences in `S00_H40_mean`.

The coefficient `r abs(round(discord_cat_mean2[["coefficients"]]["SEX_multimatchmixed"],3))` is the difference between the expected `S00_H40_mean` for the reference group (in this case, the female-female pairs) and the mixed-sex pairs. However, these coefficients are not significant, so it is not advisable to interpret the coefficients. 

## Regression Analysis: Race Variables

For race variables, we use the multi-match coding to examine differences between minority and non-minority pairs:

### Multimatch

The regression model with a multi-match race variable as a  predictor can be conducted as such:

```{r}
# perform kinship regressions
cat_race_reg <- discord_regression(
  data = df_link,
  outcome = "S00_H40",
  sex = "SEX",
  race = "RACE",
  demographics = "race",
  predictors = NULL,
  pair_identifiers = c("_S1", "_S2"),
  coding_method = "multi"
)
```


```{r echo = FALSE, eval = knitr::is_html_output()}
cat_race_reg %>%
  # for nicer regression output
  broom::tidy() %>%
  mutate(
    p.value = scales::pvalue(p.value, add_p = TRUE),
    across(.cols = where(is.numeric), ~ round(.x, 3))
  ) %>%
  rename(
    "Standard Error" = std.error,
    "T Statistic" = statistic
  ) %>%
  rename_with(~ snakecase::to_title_case(.x)) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(1:5, extra_css = "text-align: center;")
```

#### Interpretation:

The mean SES score for the siblings (`S00_H40_mean`) is a significant control variable (p =`r round(summary(cat_race_reg)[["coefficients"]]["S00_H40_mean","Pr(>|t|)"],3)`. The term `S00_H40_mean` is negatively associated with the difference score of SES between siblings (`S00_H40_diff`), controlling for another variable (in this case, `RACE_multimatchNONMINORITY`). It is estimated that for one unit increase of `S00_H40_mean`, the DV (`S00_H40_diff`) is expected to decrease by approximately 
`r abs(round(cat_race_reg[["coefficients"]][["S00_H40_mean"]],3))`.

The term `RACE_multimatchNONMINORITY` was a significant predictor of `S00_H40_diff` (p = `r round(summary(cat_race_reg)[["coefficients"]]["RACE_multimatchNONMINORITY","Pr(>|t|)"],3)`) after controlling for `S00_H40_mean`. This means that the difference between the "Minority-minority" sibling pairs and "nonminority-non-minority" sibling pairs significantly predicts `S00_H40_diff`. Specifically, compared to the reference group (the "minority" pairs), "nonminority" pairs are expected to have approximately `r abs(round(cat_race_reg[["coefficients"]][["RACE_multimatchNONMINORITY"]],3))` lower `S00_H40_diff`.  


### Mean SES Model with Race

We can also examine whether race composition predicts mean SES levels:

```{r}
discord_cat_mean <- lm(S00_H40_mean ~ RACE_multimatch,
  data = cat_race
)
```
```{r echo = FALSE, eval = knitr::is_html_output()}
discord_cat_mean %>%
  # for nicer regression output
  broom::tidy() %>%
  mutate(
    p.value = scales::pvalue(p.value, add_p = TRUE),
    across(.cols = where(is.numeric), ~ round(.x, 3))
  ) %>%
  rename(
    "Standard Error" = std.error,
    "T Statistic" = statistic
  ) %>%
  rename_with(~ snakecase::to_title_case(.x)) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(1:5, extra_css = "text-align: center;")
``` 

#### Interpretation:


There is significant difference between "minority" pairs and "nonminority" pairs in `S00_H40_mean` (p =`r round(summary(discord_cat_mean)[["coefficients"]]["RACE_multimatchNONMINORITY","Pr(>|t|)"],3)`). 
It is estimated that, compared to the reference group (minority pairs), the nonminority pairs would have approximately 
`r abs(round(discord_cat_mean[["coefficients"]][["RACE_multimatchNONMINORITY"]],3))` lower `S00_H40_mean`.  


## Regression Analysis: Combined Sex and Race Variables

We can include both sex and race as predictors in the same model:

Like before we restructure the data for the kinship-discordant regression, but this time we using the `discord_regression()` function, which calls the `discord_data()` function internally. 

### Multimatch

```{r}
both_multi <- discord_regression(
  data = df_link,
  outcome = "S00_H40",
  sex = "SEX",
  race = "RACE",
  demographics = "both",
  predictors = NULL,
  pair_identifiers = c("_S1", "_S2"),
  coding_method = "multi"
)
```
```{r echo = FALSE, eval = knitr::is_html_output()}
both_multi %>%
  # for nicer regression output
  broom::tidy() %>%
  mutate(
    p.value = scales::pvalue(p.value, add_p = TRUE),
    across(.cols = where(is.numeric), ~ round(.x, 3))
  ) %>%
  rename(
    "Standard Error" = std.error,
    "T Statistic" = statistic
  ) %>%
  rename_with(~ snakecase::to_title_case(.x)) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(1:5, extra_css = "text-align: center;")
```

#### Interpretation

The mean SES score for the siblings (`S00_H40_mean`) is a significant control variable (p = `r round(summary(both_multi)[["coefficients"]]["S00_H40_mean","Pr(>|t|)"],3)` ). `S00_H40_mean` is negatively associated with the difference score of SES between the siblings (`S00_H40_diff`), controlling for other variables (in this case, the `SEX_multimatchMALE`, `SEX_multimatchmixed` and `RACE_multimatchNONMINORITY`). It is estimated that for one unit increase of the mean SES score for the sibling pairs( `S00_H40_mean`), the difference score of SES between siblings(`S00_H40_diff`) is expected to decrease approximately `r round(both_multi[["coefficients"]][["S00_H40_mean"]],3)`.

The `SEX_multimatchmixed` and `SEX_multimatchMALE` are not significant predictors when controlling for other variables (i.e., `S00_H40_mean` and `RACE_multimatchNONMINORITY`). The coefficient `r round(both_multi[["coefficients"]][["SEX_multimatchMALE"]],3)` is the difference between the expected DV (`S00_H40_diff`) for the reference group (in this case, the "female-female" pairs) and the "male-male" pairs. The coefficient `r round(both_multi[["coefficients"]][["SEX_multimatchmixed"]],3)` is the difference between the expected DV (`S00_H40_diff`) for the female-female pairs and the mixed-sex pairs. However, these coefficients are not significant, so it is not advisable to interpret the coefficients.
 
The term `RACE_multimatchNONMINORITY` is a significant predictor (p = `r round(summary(both_multi)[["coefficients"]]["RACE_multimatchNONMINORITY", "Pr(>|t|)"],3)`) when controlling for other variables (i.e., `SEX_multimatchMALE`, `SEX_multimatchmixed`, and `S00_H40_mean`). This means that there is a significant difference between minority race pairs and nonminority race pairs in the difference score of SES between siblings (`S00_H40_diff`) when controlling for the model covariates (i.e., `SEX_multimatchMALE`, `SEX_multimatchmixed`, and `S00_H40_mean`). Specifically, compared to the minority race pairs, the nonminority race pairs were expected to have approximately `r round(both_multi[["coefficients"]][["RACE_multimatchNONMINORITY"]],3)` higher difference score of SES between siblings at age 40.


### Alternative Model: Binary Sex Match and Multi-Match Race

To combine binary and multi-match coding approaches, we can use the standard `lm()` function:

We can perform regression using the binary-match sex variable and multi-match race variable as such:
 
```{r}
discord_cat_diff <- lm(
  S00_H40_diff ~ S00_H40_mean +
    RACE_multimatch + SEX_binarymatch,
  data = cat_both
)
```

```{r echo = FALSE, eval = knitr::is_html_output()}
# for nicer regression output

discord_cat_diff %>%
  broom::tidy() %>%
  mutate(
    p.value = scales::pvalue(p.value, add_p = TRUE),
    across(.cols = where(is.numeric), ~ round(.x, 3))
  ) %>%
  rename(
    "Standard Error" = std.error,
    "T Statistic" = statistic
  ) %>%
  rename_with(~ snakecase::to_title_case(.x)) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(1:5, extra_css = "text-align: center;")
```

#### Interpretation:

The mean SES score for the siblings at 40 (`S00_H40_mean`) is a significant control variable (p =
`r round(summary(discord_cat_diff)[["coefficients"]]["S00_H40_mean","Pr(>|t|)"],3)`. The mean SES score for the siblings (`S00_H40_mean`) is negatively associated with the difference score of SES between the siblings (`S00_H40_diff`), controlling for other variables (in this case, `SEX_binarymatchsame-sex` and `RACE_multimatchNONMINORITY`). It is estimated that for one unit increase of `S00_H40_mean`, the DV (`S00_H40_diff`) is expected to decrease approximately  `r abs(round(discord_cat_diff[["coefficients"]][["S00_H40_mean"]],3))`. 

The term `SEX_binarymatchsame-sex` is not a significant predictor (p = `r round(summary(discord_cat_diff)[["coefficients"]][["SEX_binarymatchsame-sex", "Pr(>|t|)"]],3)`) when controlling for other variables (i.e., `S00_H40_mean` and `RACE_multimatchNONMINORITY`). This means that the difference between same-sex pairs and mixed-sex pairs does not significantly predict the difference score of SES between siblings (`S00_H40_diff`) when controlling for the mean SES score for the siblings (`S00_H40_mean`) and race-composition of the pair (`RACE_multimatchNONMINORITY`). Compared to the mixed-sex pairs, it is estimated that the same-sex pairs have approximately `r abs(round(discord_cat_diff[["coefficients"]][["SEX_binarymatchsame-sex"]],3))`higher difference score of SES between siblings (`S00_H40_diff`) when controlling for the mean SES score for the sibling pairs (`S00_H40_mean`) and race-composition of the pairs (`RACE_multimatchNONMINORITY`). However, this coefficient is not statistically significant and should not be interpreted.

The term `RACE_multimatchNONMINORITY`is a significant predictor (p = `r round(summary(discord_cat_diff)[["coefficients"]][["RACE_multimatchNONMINORITY", "Pr(>|t|)"]],3)`). This means that there is a significant difference between minority race pairs and nonminority race pairs to predict the difference score of SES between the siblings (`S00_H40_diff`) when controlling for the model covariates (i.e., `SEX_binarymatchsame-sex` and `S00_H40_mean`). Specifically, compared to the minority race pairs, nonminority race pairs were expected to have approximately `r abs(round(discord_cat_diff[["coefficients"]][["RACE_multimatchNONMINORITY"]],3))` lower difference scores of SES between siblings (`S00_H40_diff`). 



### Mean SES Models with Both Variables

Finally, we examine how sex and race together predict mean SES levels:


```{r}
discord_cat_mean <- lm(
  S00_H40_mean ~ RACE_multimatch +
    SEX_multimatch,
  data = cat_both
)
```
```{r echo = FALSE, eval = knitr::is_html_output()}
discord_cat_mean %>%
  broom::tidy() %>%
  mutate(
    p.value = scales::pvalue(p.value, add_p = TRUE),
    across(.cols = where(is.numeric), ~ round(.x, 3))
  ) %>%
  rename(
    "Standard Error" = std.error,
    "T Statistic" = statistic
  ) %>%
  rename_with(~ snakecase::to_title_case(.x)) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(1:5, extra_css = "text-align: center;")
``` 

#### Interpretation:

The term `SEX_multimatchMALE` is a significant predictor (p = `r round(summary(discord_cat_mean)[["coefficients"]]["SEX_multimatchMALE","Pr(>|t|)"],3)`) when controlling for other variables (i.e., `SEX_multimatchmixed`and `RACEe_multimatchNONMINORITY`). This means that the difference between female-female pairs and  male-male pairs  significantly predicted the mean SES score for the siblings  when controlling for race-composition of the pairs. Compared to the female-female pairs, it is estimated that the male-male pairs have approximately 
`r abs(round(discord_cat_mean[["coefficients"]][["SEX_multimatchMALE"]],3))` higher mean SES score for the siblings when controlling for and race-composition of the pairs.  

The term `SEX_multimatchmixed` was not a significant predictor (p = `r round(summary(discord_cat_mean)[["coefficients"]]["SEX_multimatchmixed","Pr(>|t|)"],3)`) when controlling for other variables (i.e., `SEX_multimatchMALE`and `RACE_multimatchNONMINORITY`). This means that the difference between female-female pairs and mixed-sex pairs does not significantly predict the mean SES score for the siblings when controlling for race-composition of the pairs. Compared to the female-female pairs, it is estimated that the mixed-sex pairs have approximately 
`r abs(round(discord_cat_mean[["coefficients"]][["SEX_multimatchmixed"]],3))` higher mean SES score for the sibling pairs when controlling for and race-composition of the pairs. However, this variable is not significant, so it is not advisable to interpret the coefficient.   


The term `RACE_multimatchNONMINORITY` is a significant predictor (p = `r round(summary(discord_cat_mean)[["coefficients"]]["RACE_multimatchNONMINORITY","Pr(>|t|)"],3)`). This means that there is a significant difference between minority race pairs and nonminority race pairs in the mean SES score for the sibling pairs (`S00_H40_mean`) when controlling for the other variables (i.e., `SEX_multimatchmixed` and `SEX_multimatchMALE`). Specifically, compared to the minority race pairs, nonminority race pairs were expected to have approximately `r abs(round(discord_cat_mean[["coefficients"]][["RACE_multimatchNONMINORITY"]],3))` higher mean SES score for siblings 

### Mean SES Models with Combining multimatch and binarymatch


```{r}
discord_cat_mean2 <- lm(S00_H40_mean ~ RACE_multimatch + SEX_binarymatch,
  data = cat_both
)
```

```{r echo = FALSE, eval = knitr::is_html_output()}
discord_cat_mean2 %>%
  broom::tidy() %>%
  mutate(
    p.value = scales::pvalue(p.value, add_p = TRUE),
    across(.cols = where(is.numeric), ~ round(.x, 3))
  ) %>%
  rename(
    "Standard Error" = std.error,
    "T Statistic" = statistic
  ) %>%
  rename_with(~ snakecase::to_title_case(.x)) %>%
  kableExtra::kbl("html", align = "c") %>%
  kableExtra::kable_styling() %>%
  kableExtra::column_spec(1:5, extra_css = "text-align: center;")
``` 


#### Interpretation:

The term `SEX_binarymatchsame-sex` is not a significant predictor (p = `r round(summary(discord_cat_mean2)[["coefficients"]]["SEX_binarymatchsame-sex","Pr(>|t|)"],3)`) when controlling for the race-composition variable (i.e., `RACE_multimatchNONMINORITY`). This means that the difference between mixed-sex pairs and same sex pairs does not significantly predict the mean SES score for the siblings  when controlling for race-composition of the pairs. Compared to the mixed-sex pairs, it is estimated that the same-sex pairs have approximately 
`r abs(round(discord_cat_mean2[["coefficients"]][["SEX_binarymatchsame-sex"]],3))` higher mean SES score for the sibling pairs (`S00_H40_mean`) when controlling for and race-composition of the pairs (`RACE_multimatchNONMINORITY`). However, this variable is not significant, so it is not advisable to interpret the coefficient.    


The term `RACE_multimatchNONMINORITY` is a significant predictor (p = `r round(summary(discord_cat_mean2)[["coefficients"]]["RACE_multimatchNONMINORITY","Pr(>|t|)"],3)`). This means that there is a significant difference between minority race pairs and nonminority race pairs in the the mean SES score for the siblings when controlling for the sex-composition variable (i.e., `SEX_binarymatchsame-sex`). Specifically, compared to the minority race pairs, nonminority race pairs were expected to have approximately `r abs(round(discord_cat_mean2[["coefficients"]][["RACE_multimatchNONMINORITY"]],3))` higher mean SES score for siblings 

# Conclusion

This vignette has demonstrated how to incorporate categorical predictors in discordant-kinship regression analyses using the `discord` package. 

Key findings and recommendations include:

- Variable Type Matters: Categorical variables must be handled differently depending on whether they are between-dyads variables (like race in our filtered sample) or mixed variables (like sex in sibling pairs).
- Coding Approaches Offer Different Insights:
  - Binary match coding examines whether similarity/difference matters
  - Multi-match coding allows for more detailed examination of specific category effects

## Implementation Recommendations:
For implementation in your own research, we recommend:
- Consider the theoretical nature of your categorical predictors
- Use `discord_data()` to prepare categorical variables appropriately
- Choose coding schemes based on your specific research questions
- Carefully interpret results in light of variable coding decisions


# References