---
title: "hyreg2 basics"
output:
html_document:
toc: true
toc_depth: 3
date: "2025-12-12"
editor_options:
markdown:
wrap: sentence
vignette: >
%\VignetteIndexEntry{hyreg2 basics}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: references.bib
---
# The package in general
Our R package `hyreg2` allows users to estimate models on a combination of continuous and dichotomous data using a combined likelihood.
Latent classes can be detected using the EM-algorithm[@dempster1977].
The package was originally written to address preference heterogeneity or use a data-driven subgrouping of participants while estimating EQ-5D value sets[@devlin2022].
However, it can be applied to any situation in which a combined likelihood of continuous and dichotomous data should be maximized, e.g. as usual in environmental economics [@cameron1987].
**Functions**
The code for estimating models is based on the R package `flexmix`[@flexmix], which offers the option to write custom likelihood functions (M-step-drivers) for use during the M-step[@leisch2004][@grün2008].
For the `hyreg2` package, we wrote M-step-drivers that implement the hybrid model likelihood as described by Ramos-Goni [@ramos-goñi2017].
Additional wrapper functions `hyreg2` and `hyreg2_het` are included in the package, such that user do not have to input the drivers to `flexmix` themselves.

The package supports both linear and non-linear functions while using uncensored or censored data.
Furthermore, heteroscedasticity can be taken into account for linear models using the `hyreg2_het` function.
It is also possible to control for different covariates on the continuous and dichotomous parts of the data.
Start values to be used can differ between the expected latent classes.
Additionally, `hyreg2` offers the option to estimate the latent classes either on both types of data (continuous and dichotomous) or only on one type.
Once a model has been estimated, the functions `summary_hyreg2`, `give_class`, and `plot_hyreg2` can be used to evaluate the results.
# Basic functionalities
When using `hyreg2` to estimate latent class models, it is essential to account for groups of observations.
Controlling for such dependencies is essential to ensure valid parameter estimation and inference.
To demonstrate the implications of neglecting to control for groups of observations, we first simulate a dataset with two latent classes.
Afterwards we estimate a model without controlling for groups and explain where the model fails.
At the end of this chapter, we will demonstrate how a good model fit can be achieved.
## Simulate Dataset
We demonstrate the basic functionalities using a part of the the simulated dataset `simulated_data_norm`, which is included in the `hyreg2` package.
The dataset consists of 7 variables, with 200 continuous and 100 dichotomous observations in each of the two latent classes, resulting in a total number of 600 observations.
The variables `x1`, `x2` and `x3` contain random draws from a standard normal distribution.
Based on these, outcome values `y` are generated according to the linear predictor
$x_1 * \beta_{x1} +x_2 * \beta_{x2}+ x_3 * \beta_{x3} = Xb$.
Additionally, the variable `y_non` is computed according to the non-linear model $\frac{1}{\exp(x_1 * \beta_{x1} + x_2 * \beta_{x2})} = Xb_{non}$.
For the continuous data, `y` and `y_non` values were generated directly using the function rnorm with `Xb` or `Xb_non` as the mean.
For the dichotomous data, the values of `Xb` or `Xb_non` are multiplied by a scaling parameter `theta` and subsequently simulated using the function `rbinom`.
The code for the simulation can be seen below.
```{r chunk-1, message=FALSE, warning=FALSE}
# simulate dataframe called simulated_data_norm
library(hyreg2)
library(flexmix)
### CLASS 1 ###
# Set parameters
set.seed(42)
n_samples_tto1 <- 200
n_samples_dce1 <- 100
sigma1 <- 1.0
theta1 <- 5
beta1 <- c(0.5, -0.3, 0.8) # true values for beta
# simulate matrix
x_tto1 <- matrix(rnorm(n_samples_tto1 * length(beta1)), ncol = length(beta1))
x_dce1 <- matrix(rnorm(n_samples_dce1 * length(beta1)), ncol = length(beta1))
#Xb (linear predictor following formula xb = x1*beta1 + x2*beta2 + x3*beta3)
Xb_tto1 <- x_tto1 %*% beta1
Xb_dce1 <- (x_dce1 %*% beta1) * theta1
# generate y
y_tto1 <- rnorm(n_samples_tto1, mean = Xb_tto1, sd = sigma1) # continuous outcomes
logistic_tmp1 <- 0.5 + 0.5 * tanh(Xb_dce1 / 2)
y_dce1 <- rbinom(n_samples_dce1, size = 1, prob = logistic_tmp1) # binary outcomes
### dataset ###
data_tto1 <- data.frame(
type = rep("TTO", n_samples_tto1),
y = y_tto1,
x1 = x_tto1[, 1],
x2 = x_tto1[, 2],
x3 = x_tto1[, 3]
)
data_dce1 <- data.frame(
type = rep("DCE_A", n_samples_dce1),
y = y_dce1,
x1 = x_dce1[, 1],
x2 = x_dce1[, 2],
x3 = x_dce1[, 3]
)
simulated_data1 <- rbind(data_tto1, data_dce1)
### CLASS 2 ###
# Set parameters
set.seed(37)
n_samples_tto2 <- 200
n_samples_dce2 <- 100
sigma2 <- 0.5
theta2 <- 2
beta2 <- c(1.4, 2.3, -0.2) # true values for beta
# simulate matrix
x_tto2 <- matrix(rnorm(n_samples_tto2 * length(beta2)), ncol = length(beta2))
x_dce2 <- matrix(rnorm(n_samples_dce2 * length(beta2)), ncol = length(beta2))
# Xb (linear predictor following formula xb = x1*beta1 + x2*beta2 + x3*beta3)
Xb_tto2 <- x_tto2 %*% beta2
Xb_dce2 <- (x_dce2 %*% beta2) * theta2
# generate y
y_tto2 <- rnorm(n_samples_tto2, mean = Xb_tto2, sd = sigma2) # continuous outcomes
logistic_tmp2 <- 0.5 + 0.5 * tanh(Xb_dce2 / 2)
y_dce2 <- rbinom(n_samples_dce2, size = 1, prob = logistic_tmp2) # binary outcomes
### dataset ###
data_tto2 <- data.frame(
type = rep("TTO", n_samples_tto2),
y = y_tto2,
x1 = x_tto2[, 1],
x2 = x_tto2[, 2],
x3 = x_tto2[, 3]
)
data_dce2 <- data.frame(
type = rep("DCE_A", n_samples_dce2),
y = y_dce2,
x1 = x_dce2[, 1],
x2 = x_dce2[, 2],
x3 = x_dce2[, 3]
)
simulated_data2 <- rbind(data_tto2, data_dce2)
### DATASET with two classes ###
simulated_data1$class <- 1
simulated_data2$class <- 2
# set id numerbs
simulated_data1$id <- c(1:100, 1:100, 1:100)
simulated_data2$id <- c(101:200, 101:200, 101:200)
simulated_data_norm<- rbind(simulated_data1,simulated_data2)
# clean environment
rm(list = setdiff(ls(), "simulated_data_norm"))
```
The column `type` indicates whether a data point was generated according to a normal distribution (continuous data) or a binomial distribution (dichotomous data).
The column `id` specifies which continuous observations are linked with which dichotomous observations, and hence is a grouping variable.
For this dataset, each `id` is associated with two continuous observations and one dichotomous observation.
The column `class` contains the true underlying class memberships of the data.
This is visualized in figure 1 using the function `plot_hyreg2`.
```{r chunk-2, message=FALSE, warning=FALSE}
# visualize true classes from simulated_data_norm
p <- plot_hyreg2(data = simulated_data_norm,
x = "id",
y = "y",
id_col = "id",
class_df_model = simulated_data_norm[,c("id","class")],
colors = c("#F8766D","#00BFC4"))
p <- p + ggplot2::labs(title = "Figure 1")
p
```
The function `plot_hyreg2` expects a dataset (`data`) and two column names of this dataset as characters (`x` and `y`).
Additionally, the inputs `id_col` and `class_df_model` must be specified.
`id_col` must be the name of the grouping variable (as a character string) while `class_df_model` must be a dataframe containing two columns: the grouping variable (which must be named according to `id_col`) and the associated classes.
In this case, we already know which id is in which class, since we have simulated the dataset ourselves.
In real-world applications, users should instead use the `give_class` function to match each id to the class estimated by the model.
This is explained in further detail below.
`plot_hyreg2` returns an `ggplot`-object, hence additional `ggplot` specific parameters can be added, as demonstrated above.
## Estimation without further restrictions
First, we estimate a model without controlling for groups, i.e., we assign latent classes to each observation independently, without accounting for the fact that, in the simulated dataset, two continuous and one dichotomous observation belong together.
Since our goal is to identify two latent classes, we specify `k = 2`.
As we would like to estimate the mathematical model $x_1 * \beta_{x1} +x_2 * \beta_{x2}+ x_3 * \beta_{x3} = y$, we specify `formula` in R as `y ~ -1 + x1 + x2 + x3`.
Starting values must be provided for all variables in formula, hence for `x1, x2` and `x3`.
In addition, starting values for the scaling parameter `theta` and the parameter `sigma` (from the continuous data) must be supplied.
These values must necessarily be named `“theta”` and `“sigma”`, otherwise the function will not recognize them.
The above formula does not include an intercept.
If the model to be estimated does contain an intercept or interaction terms, start values must be provided for these as well.
They must be named according to the column names of model.matrix(formula,data), e.g. "(Intercept)".
Now, we specify a named vector `stv`, which contains the assumed start values for `x1, x2, x3` and for `theta`and `sigma`.Furthermore, in `hyreg2` the input argument `type` must be specified.
In our dataset, the column indicating which datapoint was simulated from a normal distribution and which from a binomial distribution is simply called type with the entries `“TTO”` and `“DCE_A”`.
In this first example, we set `latent = “both”`, which implies that the latent classes are assumed to manifest in both types of data.
```{r chunk-3, message=FALSE, warning=FALSE, results='hide'}
set.seed(277)
formula <- y ~ -1 + x1 + x2 + x3
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "both"
)
```
The function `summary_hyreg2` can be used to display the estimated parameters for both model classes.
Please note, that the outputs for `sigma` and `theta` are on a log-scale and have to be transformed using `exp()`to get the correct estimated values.
```{r chunk-4, message=FALSE, warning=FALSE}
summary_hyreg2(mod)
```
We visualize the estimated values using the function `plot_hyreg2`.
The argument `id_col` refers to the grouping column.
If the model was estimated without grouping (as in this case), the row names/numbers should simply be provided.
The input `class_df_model` requires a data frame with two columns, indicating which group (or observation) from the data has been assigned to which class by the model.
Such a data frame can easily be generated using the function `give_class`.
Again, if the model was estimated without controlling for groups, the row names of the dataset should be used as the input for `id_col` in the function `give_class`.
```{r chunk-5, message=FALSE, warning=FALSE}
# if model was estimated without grouping variable
simulated_data_norm$observation <- rownames(simulated_data_norm)
p <- plot_hyreg2(data = simulated_data_norm,
x = "id",
y = "y",
id_col = "observation",
class_df_model = give_class(data = simulated_data_norm,
model = mod,
id_col = "observation"))
p <- p + ggplot2::labs(title = "Figure 2")
p
```
This plot shows both types of observations (continuous and dichotomous).
There seem to be two lines of mostly blue (class 2) datapoints.
These are the datapoints corresponding to the dichotomous data.
It seems that the model (erroneously!) classified the data by datatypes, rather than by latent classes.
Using `plot_hyreg2`, it is also possible to display only the continuous data or only the dichotomous data.
For this purpose, the argument `type_to_plot` is used.
This argument expects a `list` with two elements: the first element contains the name of the column used to differentiate between the data types (as a character string), and the second element specifies the value associated with the desired type of data to be included in the plot.
This works analogously to `type` and `type_cont` / `type_dich` in the `hyreg2` input.
```{r chunk-6, message=FALSE, warning=FALSE}
# only continuous datapoints
p <- plot_hyreg2(data = simulated_data_norm,
x = "id",
y = "y",
id_col = "observation",
class_df_model = give_class(data = simulated_data_norm,
model = mod,
id_col = "observation"),
type_to_plot = list("type","TTO"))
p <- p + ggplot2::labs(title = "Figure 3")
p
```
```{r chunk-7, message=FALSE, warning=FALSE}
# only dichotomous datapoints
p <- plot_hyreg2(data = simulated_data_norm,
x = "id",
y = "y",
id_col = "observation",
class_df_model = give_class(data = simulated_data_norm,
model = mod,
id_col = "observation"),
type_to_plot = list("type","DCE_A"))
p <- p + ggplot2::labs(title = "Figure 4")
p
```
Figure 3 shows only the continuous datapoints while figure 4 shows only the dichotomous ones.
It can be seen that the dichotomous observations of the lower id numbers were often classified to class 2 while the continuous observations of the lower id numbers often were classified to class 1.
Additionally, the visualizations suggest that a considerable proportion of the dichotomous data is concentrated in class 2, whereas the majority of continuous data are assigned to class 1.
Even though the estimates of the model parameters for class 1 come quite close to the true parameters, the classification overall is far away from the true underlying classes, since only 58 % of the datapoints were classified correctly.
Moreover, it can be useful to have a look at the distribution of data types across the two latent classes.
```{r chunk-8, message=FALSE, warning=FALSE}
# ratio of correct classified datapoints
(sum(mod@cluster == simulated_data_norm$class))/dim(simulated_data_norm)[1]
# cross table of observation types and assigned classes
proof <- merge(simulated_data_norm,
give_class(simulated_data_norm,mod,"observation"), by = "observation")
table(proof$type,proof$mod_comp)
```
The model classifies more than 75% of the dichotomous data into class 2 and less than 25% into class 1.
By contrast, more than 60% of the continuous data points are assigned to class 1.
Overall, only 14% of the observations in class 1 are dichotomous, whereas in class 2 their proportion exceeds 50%.
All in all, this shows that it is necessary to control for groups of observations during the estimation process.
Otherwise, the model tends to differentiate between dichotomous and continuous data.
However, this represents an undesired behavior, as we would expect both classes to exhibit the same proportion of continuous and dichotomous data.
## Classifying individuals not observations
Including a grouping variable in the formula ensures that the class probabilities in the E-step of the EM-algorithm (via `flexmix`) are estimated per group rather than per observation.
This can be done by adding `"| group_variable_name"` to `formula`.
Further details can be found in Gruen and Leisch.
This specification has no impact on the likelihood function used in the M-step.
The model estimates change solely because groups, rather than each observation, are classified.
```{r chunk-9, message=FALSE, warning=FALSE, results='hide'}
set.seed(227)
formula <- y ~ -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod_id <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "both"
)
```
```{r chunk-9a, message=FALSE, warning=FALSE}
summary_hyreg2(mod_id)
```
The model estimates look worse than in the model without controlling for groups.
Nevertheless, we will examine the classification achieved by the model in the following plot.
```{r chunk-10, message=FALSE, warning=FALSE}
p <- plot_hyreg2(data = simulated_data_norm,
x ="id",
y = "y",
id_col ="id",
class_df_model = unique(give_class(data = simulated_data_norm,
model = mod_id,
id_col = "id")))
p <- p + ggplot2::labs(title = "Figure 5")
p
```
It can be seen that most of the dichotomous and continuous data points with lower id numbers were assigned to class 1.
For higher id numbers, an assignment to class 2 would be expected.
Although this assignment occurred in many cases, several data points with higher id numbers were nevertheless incorrectly classified to class 1.
An evaluation of the classification indicates that 74% of the data points were correctly assigned, representing a clear improvement compared to the model without control for grouping.
An examination of the distribution further reveals that in both classes, approximately 33% of the data are dichotomous and about 67% are continuous.
This pattern corresponds to the data-generating process used in the simulation.
However, the resulting classes differ in total size.
Overall, too many datapoints have been assigned to class 1.
```{r chunk-13, message=FALSE, warning=FALSE}
# ratio of correct classified datapoints
(sum(mod_id@cluster == simulated_data_norm$class))/dim(simulated_data_norm)[1]
# cross table of observation types and assigned classes
proof <- merge(simulated_data_norm,
unique(give_class(simulated_data_norm,mod_id,"id")), by = "id")
table(proof$type,proof$mod_comp)
```
In the dichotomous data, only the values 0 and 1 can be distinguished and the simulated dataset contains only one dichotomous datapoint for each id number.
Hence, no patterns can be detected based on these dataponints.
Therefore, it can be concluded, that the data can primarily be distinguished based on the continuous data points.
Lower id numbers in general exhibit less variability compared to higher id numbers.
This outcome is consistent with our expectations, as the data were simulated accordingly: The latent classes were determine only by the continuous data, where the beta coefficients were explicitly defined.
The subsequent combination of these beta values with theta then produced the dichotomous data points.
## Estimating latent classes only on continuous data
If it is assumed (as in this simulation) that the latent classes are determined only by one of the two data types, the input variable `latent` allows this to be taken into account during estimation.
It is possible to estimate latent classes only on continuous data (`latent = “cont”`) or only on dichotomous data (`latent = “dich”`).
If one of these options is chosen, it is not possible to estimate models without controlling for groups.
Additionally, the input variable `id_col` must be specified; it gives the name of the grouping variable as character.
Latent classes are then determined in a first step based only on one type of data.
In a second step, these classes are used to estimate the model parameters on both types of data.
Here, we want to account for preference heterogeneity in the continuous data part, and hence we set `latent = “cont”`.
```{r chunk-14, message=FALSE, warning=FALSE, results='hide'}
set.seed(227)
formula <- y ~ -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod_cont <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "cont",
id_col = "id"
)
```
The output object of `hyreg2` in this case is a list of length k+1 (k models plus one data frame called "id_classes", which contains the estimated group memberships from the first estimation step).
The functions `summary_hyreg2` and `plot_hyreg2` can still be applied without restrictions, as they automatically recognize whether the input is a single model or a list of models.
```{r chunk-14a, message=FALSE, warning=FALSE}
summary_hyreg2(mod_cont)
```
The estimates of this model nearly capture the true parameter values, especially for `x1` in both classes.
`x2` and `x3` are less precise, although the results are still acceptable.
Again, we have a look at the overall classification.
```{r chunk-15, message=FALSE, warning=FALSE}
p <- plot_hyreg2(data = simulated_data_norm,
x ="id",
y = "y",
id_col ="id",
class_df_model = give_class(data = simulated_data_norm,
model = mod_cont,
id_col = "id"))
p <- p + ggplot2::labs(title = "Figure 6")
p
```
```{r chunk-16, message=FALSE, warning=FALSE}
# ratio of correct classified datapoints
proof <- merge(unique(simulated_data_norm[,c("id","class")]),mod_cont[["id_classes"]], by = "id")
sum((proof$class == proof$mod_comp)/dim(proof)[1])
# cross table of observation types and assigned classes
proof <- merge(simulated_data_norm,
mod_cont[[k+1]], by = "id")
table(proof$type,proof$mod_comp)
```
Assessing the ratio of correct classification reveals that 95 % of the observations are now classified to the correct class.
In addition, the data are evenly distributed across classes and data types, as it was generated during the simulation.
Hence, using the input parameter `latent = “cont”` under control for the grouping variable id leads to very good results, since the true underlying classes could be identified and the estimated parameters come close to the true beta values from the data simulation.
# Advanced functionalities
In this section, we present the additional functionalities of the package.
We demonstrate how different types of input must be specified in order to estimate well-performing models.
At this point, we refrain from providing detailed interpretations of the obtained results.
## Estimating models using censored data
The function `hyreg2` can be applied to censored data by specifying the parameters `lower` and/or `upper`.
We simulate a new column `y_cens` in our dataset simulated_data_norm with censoring at `y = 3`.
This censoring can be incorporated in the model estimation by setting the parameter `upper = 3`.
Equivalent parameter `lower` can be set to a number, where data have lower bounds.
```{r chunk-18, message=FALSE, warning=FALSE, results='hide'}
# upper bound for simulated_data_norm$y at 3
# new variable y_cens
for(i in 1:length(simulated_data_norm$y)){
if(simulated_data_norm$y[i] <= 3){
simulated_data_norm$y_cens[i] <- simulated_data_norm$y[i]
}else{
simulated_data_norm$y_cens[i] <- 3
}
}
set.seed(227)
formula <- y_cens ~ -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod_cens <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
upper = 3,
latent = "cont",
id_col = "id"
)
```
We can inspect the results using `summary_hyreg2` and `plot_hyreg2`.
```{r chunk-19, message=FALSE, warning=FALSE, results='hide', fig.show='hide'}
summary_hyreg2(mod_cens)
plot_hyreg2(data = simulated_data_norm,
x ="id",
y = "y_cens",
id_col = "id",
class_df_model = give_class(data = simulated_data_norm,
model = mod_cens,
id_col = "id"))
```
## Providing start values as a list
In some situations, it may be useful to specify different starting values for the different latent classes.
This can potentially shorten or improve the estimation process, for instance, when local rather than global maxima are found during maximum likelihood estimation.
For this purpose, the input parameter `stv` can simply be defined as a `list` consisting of k named vectors.
```{r chunk-22, message=FALSE, warning=FALSE, results='hide'}
set.seed(227)
formula <- y ~ -1 + x1 + x2 + x3 | id
k <- 2
stv1 <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
stv2 <- setNames(c(1,2,0.5,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod_stvl<- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = list(stv1,stv2),
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "cont",
id_col = "id"
)
summary_hyreg2(mod_stvl)
```
## Accounting for heteroscedasticity using start values from previous model estimates
**Function get_stv**
In some cases, it can be useful to use the estimated values from a simpler model as starting values for a subsequent model.
For this purpose, the function `get_stv` can be used.
Its input must be a previously estimated model.
If this model was estimated with `latent = "both"`, it is sufficient to provide the model name as input.
Additionally, the parameter `comp` can be used to specify from which class the parameters should be extracted, for example: `get_stv(mod = mod_id, comp = "Comp.1")`.
By default, the first class/component is used.
If the model to be used was estimated with `latent = "cont"` or `latent = "dich"`, the output of `hyreg2`is a list.
In this case, one of the k models from the output must be explicitly selected as input for `get_stv`, for example: `get_stv(mod = mod_cont[[1]])`.
Additionally, in this case, the input parameter `comp` must always be `"Comp.1"`, since the model component to be used is selected by specifying the list element.
**Accounting for heteroscedasticity**
The `hyreg2` package includes an additional function and an M-step driver that allow controlling for heteroscedasticity by estimating a separate model for sigma.
The corresponding function is called `hyreg2_het`.
This function supports all functionalities of `hyreg2` (except `formula_type_classic = FALSE`) and can therefore be used in an equivalent manner.
To account for heteroscedasticity in the data, an additional formula called `formula_sigma` and an additional vector of starting values for this formula (`stv_sigma`) can be specified.
The provided `formula_sigma` must be linear and the vector `stv_sigma` must contain start values for all parameters used in `formula_sigma`.
It is important to note that, when using `hyreg2_het`, neither `stv` nor `stv_sigma` are allowed to include a start value called `"sigma"`, because sigma is estimated with its own formula (in contrast to `hyreg2`, where `"sigma"` must always be specified in `stv`).
```{r chunk-24, message=FALSE, warning=FALSE, results='hide'}
set.seed(227)
formula <- y ~ -1 + x1 + x2 + x3 | id
formula_sigma <- y ~ x2 + x3
k <- 2
stv1 <- get_stv(mod_cont[[1]])[-4] # -4 since "sigma" is on fourth place in return vector
stv2 <- get_stv(mod_cont[[2]])[-4] # -4 since "sigma" is on fourth place in return vector
stv_sigma <- setNames(rep(0.1,3), c("x2","x3","(Intercept)"))
control <- list(iter.max = 1000, verbose = 4)
mod_het <- hyreg2_het(formula = formula,
formula_sigma = formula_sigma,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = list(stv1,stv2),
stv_sigma = stv_sigma,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "cont",
id_col = "id"
)
```
```{r chunk-24a, message=FALSE, warning=FALSE}
summary_hyreg2(mod_het)
```
The estimates for sigma can be identified in the model output by the ending `"_h"`.
## Estimating models using partial coefficients
It is possible to specify partial coefficients, which are used only on continuous or only on dichotomous data.
*Example:*
Suppose different models should be specified for continuous and dichotomous data.
For example:
- Continuous data: `y ~ x1 + x3`
- Dichotomous data: `y ~ x1 + x2`
The `formula` input to `hyreg2` must then include all parameters that occur in either (i.e., in at least one) model:
`y ~ x1 + x2 + x3`
The assignment of parameters to data types is then achieved via the input arguments:
`variables_both = “x1”`, `variables_cont = “x3”` and `variables_dich = “x2”`.
Every variable included in the provided `formula` (except the grouping variable ) must appear in exactly one of these vectors.
One of the variables\_ vectors can also be `NULL`, if no variables should be used exclusively on this type of the data.
```{r chunk-23, message=FALSE, warning=FALSE, results='hide'}
set.seed(227)
formula <- y ~ -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c(colnames(simulated_data_norm)[3:5],c("sigma","theta")))
control <- list(iter.max = 1000, verbose = 4)
mod_partialc <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "cont",
id_col = "id",
variables_both = "x1",
variables_cont = "x3",
variables_dich = "x2"
)
summary_hyreg2(mod_partialc)
```
## Estimating models using non-classic formulas
All inputs and functionalities that have been considered so far are used with classic R formulas and the default setting (`TRUE`) for the input `formula_type_classic`.
However, with this setting, it is not possible to estimate non-linear formulas or the 8-parameter model for EQ-5D data[@devlin2022] .
Therefore, in `hyreg2`, it is possible to specify model formulas with variables and parameters (non-classic formulas).
For example, if a model should be estimated for the mathematical equation $\frac{1}{e^{x_1 * \beta_{1} + x_2 * \beta_{2}}} = y$, this must be specified as a formula with additional parameters provided, which are not column names in the dataset: `formula = y ~ 1/exp(x1 * beta1 + x2 * beta2)`, where the `beta`s represent the parameters to be estimated and the`x`s are column names from the dataset.
Additionally, `formula_type_classic` must be set to `FALSE`.
When estimating an intercept, the formula must explicitly include a parameter named `"INTERCEPT"`(without a corresponding variable from the dataset), e.g. `formula = y ~ 1/exp(INTERCEPT + x1 * beta1 + x2 * beta2)`
We simulate a new column called `y_non` for the dataset simulated_data_norm and show how to estimate a model using non-classic formulas.
.
```{r chunk-26, message=FALSE, warning=FALSE}
# simulate vector y_non and add this to simulated_data_norm
#library(hyreg2)
### CLASS 1 ###
# Set parameters (same as in simulation above)
set.seed(42)
n_samples_tto1 <- 200
n_samples_dce1 <- 100
sigma1 <- 1.0
theta1 <- 5
beta_non1 <- c(0.5, -0.3, 0.8) # true values for beta
# simulate matrix
x_tto1 <- matrix(rnorm(n_samples_tto1 * length(beta_non1)), ncol = length(beta_non1))
x_dce1 <- matrix(rnorm(n_samples_dce1 * length(beta_non1)), ncol = length(beta_non1))
### NON-classic ###
# x1,x2 and x3 are columnnames of the dataset, beta1, beta2 and beta3 are the parameters to be estimated
formula1 <- y ~ (x1 * beta1 + x2 * beta2 ) * (x1 *beta1 + x3 * beta3)
x_tto_non1 <- data.frame(x1 = x_tto1[,1], x2 = x_tto1[,2], x3 = x_tto1[,3])
x_dce_non1 <- data.frame(x1 = x_dce1[,1], x2 = x_dce1[,2], x3 = x_dce1[,3])
beta_non1 <- setNames(beta_non1[1:3], c("beta1","beta2","beta3"))
Xb_tto_non1 <- hyreg2:::eval_formula_non(formula1, x_tto_non1, beta_non1)
Xb_dce_non1 <- (hyreg2:::eval_formula_non(formula1, x_dce_non1, beta_non1)) * theta1
# generate y_non
y_tto_non1 <- rnorm(n_samples_tto1, mean = Xb_tto_non1, sd = sigma1) # continuous outcomes
logistic_tmp_non1 <- 0.5 + 0.5 * tanh(Xb_dce_non1 / 2)
y_dce_non1 <- rbinom(n_samples_dce1, size = 1, prob = logistic_tmp_non1) # binary outcomes
### CLASS 2 ###
# Set parameters (same as in simulation above)
set.seed(37)
n_samples_tto2 <- 200
n_samples_dce2 <- 100
sigma2 <- 0.5
theta2 <- 2
beta_non2 <- c(1.4, 2.3, -0.2) # true values for beta
# simulate matrix
x_tto2 <- matrix(rnorm(n_samples_tto2 * length(beta_non2)), ncol = length(beta_non2))
x_dce2 <- matrix(rnorm(n_samples_dce2 * length(beta_non2)), ncol = length(beta_non2))
### NON-classic ###
# x1,x2 and x3 are columnnames of the dataset, beta1, beta2 and beta3 are the parameters to be estimated
formula2 <- y ~ (x1 * beta1 + x2 * beta2 ) * (x1 * beta1 + x3 * beta3)
x_tto_non2 <- data.frame(x1 = x_tto2[,1], x2 = x_tto2[,2], x3 = x_tto2[,3])
x_dce_non2 <- data.frame(x1 = x_dce2[,1], x2 = x_dce2[,2], x3 = x_dce2[,3])
beta_non2 <- setNames(beta_non2[1:3], c("beta1","beta2","beta3"))
Xb_tto_non2 <- hyreg2:::eval_formula_non(formula2, x_tto_non2, beta_non2)
Xb_dce_non2 <- (hyreg2:::eval_formula_non(formula2, x_dce_non2, beta_non2)) * theta2
# generate y_non
y_tto_non2 <- rnorm(n_samples_tto2, mean = Xb_tto_non2, sd = sigma2) # continuous outcomes
logistic_tmp_non2 <- 0.5 + 0.5 * tanh(Xb_dce_non2 / 2)
y_dce_non2 <- rbinom(n_samples_dce2, size = 1, prob = logistic_tmp_non2) # binary outcomes
simulated_data_norm$y_non <- c(y_tto_non1,y_dce_non1,y_tto_non2,y_dce_non2)
# clean environment
rm(n_samples_tto1,
n_samples_dce1,
sigma1,
theta1,
beta_non1,
n_samples_tto2,
n_samples_dce2,
sigma2,
theta2,
beta_non2,
logistic_tmp1,
logistic_tmp2,
logistic_tmp_non1,
logistic_tmp_non2,
x_tto_non1,
x_tto_non2,
x_dce_non1,
x_dce_non2,
Xb_tto_non1,
Xb_tto_non2,
Xb_dce_non1,
Xb_dce_non2,
x_dce1,
x_dce2,
x_tto1,
x_tto2,
y_dce_non1,
y_dce_non2,
y_tto_non1,
y_tto_non2,
formula1,
formula2
)
```
If `formula_type_classic = FALSE`, the provided input for `stv` must contain the parameter names (e.g. beta1) and not the variable names (e.g. x1) from the dataset.
All functionalities described above (expect accounting for heteroscedasticity) can be used with variables-and-parameters-formulas as well.
If partial coefficients should be used, the provided variables_vectors must contain the names of the parameters and not the variable names from the dataset.
```{r chunk-25, message=FALSE, warning=FALSE, results='hide'}
set.seed(259)
formula <- y_non ~ (x1 * beta1 + x2 * beta2 ) * (x1 * beta1 + x3 * beta3)| id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("beta1","beta2","beta3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod_non <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "cont",
id_col = "id",
formula_type_classic = FALSE
)
```
The output can be displayed using `summary_hyreg2` and the classification can be inspected using `plot_hyreg2`.
```{r chunk-21, message=FALSE, warning=FALSE, results='hide'}
summary_hyreg2(mod_non)
p <- plot_hyreg2(data = simulated_data_norm,
x ="id",
y = "y_non",
id_col = "id",
class_df_model = give_class(data = simulated_data_norm,
model = mod_non,
id_col = "id"))
p <- p + ggplot2::labs(title = "Figure 7")
p
```