The package in general

Our R package hyreg2 allows users to estimate models on a combination of continuous and dichotomous data using a combined likelihood. Latent classes can be detected using the EM-algorithm(Dempster, Laird, and Rubin 1977). The package was originally written to address preference heterogeneity or use a data-driven subgrouping of participants while estimating EQ-5D value sets(Devlin, Finch, and Parkin 2022). However, it can be applied to any situation in which a combined likelihood of continuous and dichotomous data should be maximized, e.g. as usual in environmental economics (Cameron and James 1987).

Functions

The code for estimating models is based on the R package flexmix(Leisch 2004a), which offers the option to write custom likelihood functions (M-step-drivers) for use during the M-step[Leisch (2004b)](Grün and Leisch 2008). For the hyreg2 package, we wrote M-step-drivers that implement the hybrid model likelihood as described by Ramos-Goni (Ramos-Goñi et al. 2017). Additional wrapper functions hyreg2 and hyreg2_het are included in the package, such that user do not have to input the drivers to flexmix themselves.

The package supports both linear and non-linear functions while using uncensored or censored data. Furthermore, heteroscedasticity can be taken into account for linear models using the hyreg2_het function. It is also possible to control for different covariates on the continuous and dichotomous parts of the data. Start values to be used can differ between the expected latent classes. Additionally, hyreg2 offers the option to estimate the latent classes either on both types of data (continuous and dichotomous) or only on one type. Once a model has been estimated, the functions summary_hyreg2, give_class, and plot_hyreg2 can be used to evaluate the results.

Basic functionalities

When using hyreg2 to estimate latent class models, it is essential to account for groups of observations. Controlling for such dependencies is essential to ensure valid parameter estimation and inference.

To demonstrate the implications of neglecting to control for groups of observations, we first simulate a dataset with two latent classes. Afterwards we estimate a model without controlling for groups and explain where the model fails. At the end of this chapter, we will demonstrate how a good model fit can be achieved.

Simulate Dataset

We demonstrate the basic functionalities using a part of the the simulated dataset simulated_data_norm, which is included in the hyreg2 package. The dataset consists of 7 variables, with 200 continuous and 100 dichotomous observations in each of the two latent classes, resulting in a total number of 600 observations. The variables x1, x2 and x3 contain random draws from a standard normal distribution. Based on these, outcome values y are generated according to the linear predictor

\(x_1 * \beta_{x1} +x_2 * \beta_{x2}+ x_3 * \beta_{x3} = Xb\).

Additionally, the variable y_non is computed according to the non-linear model \(\frac{1}{\exp(x_1 * \beta_{x1} + x_2 * \beta_{x2})} = Xb_{non}\).

For the continuous data, y and y_non values were generated directly using the function rnorm with Xb or Xb_non as the mean. For the dichotomous data, the values of Xb or Xb_non are multiplied by a scaling parameter theta and subsequently simulated using the function rbinom. The code for the simulation can be seen below.

# simulate dataframe called simulated_data_norm

library(hyreg2)
library(flexmix)


### CLASS 1 ###

# Set parameters
set.seed(42)
n_samples_tto1 <- 200
n_samples_dce1 <- 100
sigma1 <- 1.0
theta1 <- 5
beta1 <- c(0.5, -0.3, 0.8)  # true values for beta

# simulate matrix
x_tto1 <- matrix(rnorm(n_samples_tto1 * length(beta1)), ncol = length(beta1))
x_dce1 <- matrix(rnorm(n_samples_dce1 * length(beta1)), ncol = length(beta1))



#Xb (linear predictor following formula xb = x1*beta1 + x2*beta2 + x3*beta3)
Xb_tto1 <- x_tto1 %*% beta1
Xb_dce1 <- (x_dce1 %*% beta1) * theta1


# generate y
y_tto1 <- rnorm(n_samples_tto1, mean = Xb_tto1, sd = sigma1)  # continuous outcomes
logistic_tmp1 <- 0.5 + 0.5 * tanh(Xb_dce1 / 2)
y_dce1 <- rbinom(n_samples_dce1, size = 1, prob = logistic_tmp1)  # binary outcomes



### dataset ###
data_tto1 <- data.frame(
  type = rep("TTO", n_samples_tto1),
  y = y_tto1,
  x1 = x_tto1[, 1],
  x2 = x_tto1[, 2],
  x3 = x_tto1[, 3]
)

data_dce1 <- data.frame(
  type = rep("DCE_A", n_samples_dce1),
  y = y_dce1,
  x1 = x_dce1[, 1],
  x2 = x_dce1[, 2],
  x3 = x_dce1[, 3]
)

simulated_data1 <- rbind(data_tto1, data_dce1)


### CLASS 2 ###

# Set parameters
set.seed(37)
n_samples_tto2 <- 200
n_samples_dce2 <- 100
sigma2 <- 0.5
theta2 <- 2
beta2 <- c(1.4, 2.3, -0.2)   # true values for beta


# simulate matrix
x_tto2 <- matrix(rnorm(n_samples_tto2 * length(beta2)), ncol = length(beta2))
x_dce2 <- matrix(rnorm(n_samples_dce2 * length(beta2)), ncol = length(beta2))


#  Xb (linear predictor following formula xb = x1*beta1 + x2*beta2 + x3*beta3)
Xb_tto2 <- x_tto2 %*% beta2
Xb_dce2 <- (x_dce2 %*% beta2) * theta2


# generate y
y_tto2 <- rnorm(n_samples_tto2, mean = Xb_tto2, sd = sigma2)  # continuous outcomes
logistic_tmp2 <- 0.5 + 0.5 * tanh(Xb_dce2 / 2)
y_dce2 <- rbinom(n_samples_dce2, size = 1, prob = logistic_tmp2)  # binary outcomes



### dataset ###
data_tto2 <- data.frame(
  type = rep("TTO", n_samples_tto2),
  y = y_tto2,
  x1 = x_tto2[, 1],
  x2 = x_tto2[, 2],
  x3 = x_tto2[, 3]
)

data_dce2 <- data.frame(
  type = rep("DCE_A", n_samples_dce2),
  y = y_dce2,
  x1 = x_dce2[, 1],
  x2 = x_dce2[, 2],
  x3 = x_dce2[, 3]
)

simulated_data2 <- rbind(data_tto2, data_dce2)


### DATASET with two classes ###
simulated_data1$class <- 1
simulated_data2$class <- 2

# set id numerbs
simulated_data1$id <- c(1:100, 1:100, 1:100)
simulated_data2$id <- c(101:200, 101:200, 101:200)


simulated_data_norm<- rbind(simulated_data1,simulated_data2)


# clean environment
rm(list = setdiff(ls(), "simulated_data_norm"))

The column type indicates whether a data point was generated according to a normal distribution (continuous data) or a binomial distribution (dichotomous data). The column id specifies which continuous observations are linked with which dichotomous observations, and hence is a grouping variable. For this dataset, each id is associated with two continuous observations and one dichotomous observation. The column class contains the true underlying class memberships of the data. This is visualized in figure 1 using the function plot_hyreg2.

# visualize true classes from simulated_data_norm
p <- plot_hyreg2(data = simulated_data_norm,
           x = "id",
           y = "y",
           id_col = "id",
           class_df_model = simulated_data_norm[,c("id","class")],
           colors = c("#F8766D","#00BFC4"))
p <- p +  ggplot2::labs(title = "Figure 1")
p

The function plot_hyreg2 expects a dataset (data) and two column names of this dataset as characters (x and y). Additionally, the inputs id_col and class_df_model must be specified. id_col must be the name of the grouping variable (as a character string) while class_df_model must be a dataframe containing two columns: the grouping variable (which must be named according to id_col) and the associated classes. In this case, we already know which id is in which class, since we have simulated the dataset ourselves. In real-world applications, users should instead use the give_class function to match each id to the class estimated by the model. This is explained in further detail below. plot_hyreg2 returns an ggplot-object, hence additional ggplot specific parameters can be added, as demonstrated above.

Estimation without further restrictions

First, we estimate a model without controlling for groups, i.e., we assign latent classes to each observation independently, without accounting for the fact that, in the simulated dataset, two continuous and one dichotomous observation belong together.

Since our goal is to identify two latent classes, we specify k = 2. As we would like to estimate the mathematical model \(x_1 * \beta_{x1} +x_2 * \beta_{x2}+ x_3 * \beta_{x3} = y\), we specify formula in R as y ~ -1 + x1 + x2 + x3. Starting values must be provided for all variables in formula, hence for x1, x2 and x3. In addition, starting values for the scaling parameter theta and the parameter sigma (from the continuous data) must be supplied. These values must necessarily be named “theta” and “sigma”, otherwise the function will not recognize them.

The above formula does not include an intercept. If the model to be estimated does contain an intercept or interaction terms, start values must be provided for these as well. They must be named according to the column names of model.matrix(formula,data), e.g. “(Intercept)”.

Now, we specify a named vector stv, which contains the assumed start values for x1, x2, x3 and for thetaand sigma.Furthermore, in hyreg2 the input argument type must be specified. In our dataset, the column indicating which datapoint was simulated from a normal distribution and which from a binomial distribution is simply called type with the entries “TTO” and “DCE_A”. In this first example, we set latent = “both”, which implies that the latent classes are assumed to manifest in both types of data.

set.seed(277)
formula <- y ~  -1 + x1 + x2 + x3
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)

mod <- hyreg2(formula = formula,
                     data =  simulated_data_norm,
                     type =  simulated_data_norm$type,
                     stv = stv,
                     k = k,
                     type_cont = "TTO",
                     type_dich = "DCE_A",
                     opt_method = "L-BFGS-B",
                     control = control,
                     latent = "both"
)

The function summary_hyreg2 can be used to display the estimated parameters for both model classes. Please note, that the outputs for sigma and theta are on a log-scale and have to be transformed using exp()to get the correct estimated values.

summary_hyreg2(mod)
## $Comp.1
## Maximum likelihood estimation
## 
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method, 
##     optimizer = optimizer, lower = -Inf, upper = Inf)
## 
## Coefficients:
##        Estimate Std. Error z value     Pr(z)    
## x1     0.492032   0.051579  9.5395 < 2.2e-16 ***
## x2    -0.341833   0.040001 -8.5457 < 2.2e-16 ***
## x3     0.777218   0.068013 11.4275 < 2.2e-16 ***
## sigma  0.107463   0.050454  2.1299   0.03318 *  
## theta  3.090415   0.412147  7.4983 6.463e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 614.2638 
## 
## $Comp.2
## Maximum likelihood estimation
## 
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method, 
##     optimizer = optimizer, lower = -Inf, upper = Inf)
## 
## Coefficients:
##        Estimate Std. Error  z value     Pr(z)    
## x1     1.495192   0.035832  41.7281 < 2.2e-16 ***
## x2     2.437797   0.034263  71.1496 < 2.2e-16 ***
## x3    -0.121824   0.037550  -3.2444  0.001177 ** 
## sigma -0.643939   0.049627 -12.9756 < 2.2e-16 ***
## theta  0.359478   0.205821   1.7466  0.080715 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 383.2935

We visualize the estimated values using the function plot_hyreg2. The argument id_col refers to the grouping column. If the model was estimated without grouping (as in this case), the row names/numbers should simply be provided. The input class_df_model requires a data frame with two columns, indicating which group (or observation) from the data has been assigned to which class by the model. Such a data frame can easily be generated using the function give_class. Again, if the model was estimated without controlling for groups, the row names of the dataset should be used as the input for id_col in the function give_class.

# if model was estimated without grouping variable
simulated_data_norm$observation <- rownames(simulated_data_norm)


p <- plot_hyreg2(data = simulated_data_norm,
           x = "id",
           y = "y",
           id_col = "observation",
           class_df_model = give_class(data = simulated_data_norm,
                                 model = mod,
                                 id_col = "observation"))
p <- p +  ggplot2::labs(title = "Figure 2")
p

This plot shows both types of observations (continuous and dichotomous). There seem to be two lines of mostly blue (class 2) datapoints. These are the datapoints corresponding to the dichotomous data. It seems that the model (erroneously!) classified the data by datatypes, rather than by latent classes.

Using plot_hyreg2, it is also possible to display only the continuous data or only the dichotomous data. For this purpose, the argument type_to_plot is used. This argument expects a list with two elements: the first element contains the name of the column used to differentiate between the data types (as a character string), and the second element specifies the value associated with the desired type of data to be included in the plot. This works analogously to type and type_cont / type_dich in the hyreg2 input.

# only continuous datapoints 
p <- plot_hyreg2(data = simulated_data_norm,
           x = "id",
           y = "y",
           id_col = "observation",
           class_df_model = give_class(data = simulated_data_norm,
                                 model = mod,
                                 id_col = "observation"),
           type_to_plot = list("type","TTO"))
p <- p +  ggplot2::labs(title = "Figure 3")
p

# only dichotomous datapoints
p <- plot_hyreg2(data = simulated_data_norm,
           x = "id",
           y = "y",
           id_col = "observation",
           class_df_model = give_class(data = simulated_data_norm,
                                 model = mod,
                                 id_col = "observation"),
           type_to_plot = list("type","DCE_A"))
p <- p +  ggplot2::labs(title = "Figure 4")
p

Figure 3 shows only the continuous datapoints while figure 4 shows only the dichotomous ones. It can be seen that the dichotomous observations of the lower id numbers were often classified to class 2 while the continuous observations of the lower id numbers often were classified to class 1. Additionally, the visualizations suggest that a considerable proportion of the dichotomous data is concentrated in class 2, whereas the majority of continuous data are assigned to class 1. Even though the estimates of the model parameters for class 1 come quite close to the true parameters, the classification overall is far away from the true underlying classes, since only 58 % of the datapoints were classified correctly. Moreover, it can be useful to have a look at the distribution of data types across the two latent classes.

# ratio of correct classified datapoints
(sum(mod@cluster == simulated_data_norm$class))/dim(simulated_data_norm)[1]
## [1] 0.58
# cross table of observation types and assigned classes
proof <- merge(simulated_data_norm,
                 give_class(simulated_data_norm,mod,"observation"), by = "observation")
table(proof$type,proof$mod_comp)
##        
##           1   2
##   DCE_A  41 159
##   TTO   247 153

The model classifies more than 75% of the dichotomous data into class 2 and less than 25% into class 1. By contrast, more than 60% of the continuous data points are assigned to class 1. Overall, only 14% of the observations in class 1 are dichotomous, whereas in class 2 their proportion exceeds 50%.

All in all, this shows that it is necessary to control for groups of observations during the estimation process. Otherwise, the model tends to differentiate between dichotomous and continuous data. However, this represents an undesired behavior, as we would expect both classes to exhibit the same proportion of continuous and dichotomous data.

Classifying individuals not observations

Including a grouping variable in the formula ensures that the class probabilities in the E-step of the EM-algorithm (via flexmix) are estimated per group rather than per observation. This can be done by adding "| group_variable_name" to formula. Further details can be found in Gruen and Leisch. This specification has no impact on the likelihood function used in the M-step. The model estimates change solely because groups, rather than each observation, are classified.

set.seed(227)
formula <- y ~  -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)

mod_id <- hyreg2(formula = formula,
                     data =  simulated_data_norm,
                     type =  simulated_data_norm$type,
                     stv = stv,
                     k = k,
                     type_cont = "TTO",
                     type_dich = "DCE_A",
                     opt_method = "L-BFGS-B",
                     control = control,
                     latent = "both"
)
summary_hyreg2(mod_id)
## $Comp.1
## Maximum likelihood estimation
## 
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method, 
##     optimizer = optimizer, lower = -Inf, upper = Inf)
## 
## Coefficients:
##       Estimate Std. Error z value     Pr(z)    
## x1    0.824467   0.084435  9.7645 < 2.2e-16 ***
## x2    0.924986   0.084452 10.9529 < 2.2e-16 ***
## x3    0.619043   0.078750  7.8608 3.816e-15 ***
## sigma 0.521685   0.040305 12.9435 < 2.2e-16 ***
## theta 1.261959   0.251067  5.0264 4.998e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 1241.915 
## 
## $Comp.2
## Maximum likelihood estimation
## 
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method, 
##     optimizer = optimizer, lower = -Inf, upper = Inf)
## 
## Coefficients:
##        Estimate Std. Error z value     Pr(z)    
## x1     1.550978   0.061598 25.1788 < 2.2e-16 ***
## x2     2.433191   0.062067 39.2026 < 2.2e-16 ***
## x3    -0.095873   0.059479 -1.6119     0.107    
## sigma -0.506043   0.074209 -6.8191 9.160e-12 ***
## theta -1.637745   0.390157 -4.1977 2.697e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 325.574

The model estimates look worse than in the model without controlling for groups. Nevertheless, we will examine the classification achieved by the model in the following plot.

p <- plot_hyreg2(data = simulated_data_norm,
           x ="id",
           y = "y",
           id_col ="id",
           class_df_model = unique(give_class(data = simulated_data_norm,
                                        model = mod_id,
                                        id_col = "id")))
p <- p +  ggplot2::labs(title = "Figure 5")
p

It can be seen that most of the dichotomous and continuous data points with lower id numbers were assigned to class 1. For higher id numbers, an assignment to class 2 would be expected. Although this assignment occurred in many cases, several data points with higher id numbers were nevertheless incorrectly classified to class 1. An evaluation of the classification indicates that 74% of the data points were correctly assigned, representing a clear improvement compared to the model without control for grouping. An examination of the distribution further reveals that in both classes, approximately 33% of the data are dichotomous and about 67% are continuous. This pattern corresponds to the data-generating process used in the simulation. However, the resulting classes differ in total size. Overall, too many datapoints have been assigned to class 1.

# ratio of correct classified datapoints
(sum(mod_id@cluster == simulated_data_norm$class))/dim(simulated_data_norm)[1]
## [1] 0.74
# cross table of observation types and assigned classes

proof <- merge(simulated_data_norm,
                 unique(give_class(simulated_data_norm,mod_id,"id")), by = "id")
table(proof$type,proof$mod_comp)
##        
##           1   2
##   DCE_A 134  66
##   TTO   268 132

In the dichotomous data, only the values 0 and 1 can be distinguished and the simulated dataset contains only one dichotomous datapoint for each id number. Hence, no patterns can be detected based on these dataponints. Therefore, it can be concluded, that the data can primarily be distinguished based on the continuous data points. Lower id numbers in general exhibit less variability compared to higher id numbers.

This outcome is consistent with our expectations, as the data were simulated accordingly: The latent classes were determine only by the continuous data, where the beta coefficients were explicitly defined. The subsequent combination of these beta values with theta then produced the dichotomous data points.

Estimating latent classes only on continuous data

If it is assumed (as in this simulation) that the latent classes are determined only by one of the two data types, the input variable latent allows this to be taken into account during estimation. It is possible to estimate latent classes only on continuous data (latent = “cont”) or only on dichotomous data (latent = “dich”). If one of these options is chosen, it is not possible to estimate models without controlling for groups. Additionally, the input variable id_col must be specified; it gives the name of the grouping variable as character. Latent classes are then determined in a first step based only on one type of data. In a second step, these classes are used to estimate the model parameters on both types of data.

Here, we want to account for preference heterogeneity in the continuous data part, and hence we set latent = “cont”.

set.seed(227)
formula <- y ~  -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)

mod_cont <- hyreg2(formula = formula,
                     data =  simulated_data_norm,
                     type =  simulated_data_norm$type,
                     stv = stv,
                     k = k,
                     type_cont = "TTO",
                     type_dich = "DCE_A",
                     opt_method = "L-BFGS-B",
                     control = control,
                     latent = "cont",
                     id_col = "id"
)

The output object of hyreg2 in this case is a list of length k+1 (k models plus one data frame called “id_classes”, which contains the estimated group memberships from the first estimation step). The functions summary_hyreg2 and plot_hyreg2 can still be applied without restrictions, as they automatically recognize whether the input is a single model or a list of models.

summary_hyreg2(mod_cont)
## [[1]]
## [[1]]$Comp.1
## Maximum likelihood estimation
## 
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method, 
##     optimizer = optimizer, lower = -Inf, upper = Inf)
## 
## Coefficients:
##        Estimate Std. Error z value     Pr(z)    
## x1     0.526295   0.070000  7.5185 5.542e-14 ***
## x2    -0.191621   0.066908 -2.8640  0.004184 ** 
## x3     0.815761   0.069974 11.6580 < 2.2e-16 ***
## sigma  0.090409   0.050060  1.8060  0.070918 .  
## theta  1.130938   0.215029  5.2595 1.445e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 674.4513 
## 
## 
## [[2]]
## [[2]]$Comp.1
## Maximum likelihood estimation
## 
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method, 
##     optimizer = optimizer, lower = -Inf, upper = Inf)
## 
## Coefficients:
##        Estimate Std. Error  z value     Pr(z)    
## x1     1.494318   0.033104  45.1401 < 2.2e-16 ***
## x2     2.451595   0.031989  76.6394 < 2.2e-16 ***
## x3    -0.120248   0.035521  -3.3852 0.0007112 ***
## sigma -0.723616   0.050015 -14.4680 < 2.2e-16 ***
## theta  0.384434   0.207533   1.8524 0.0639684 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 346.0418

The estimates of this model nearly capture the true parameter values, especially for x1 in both classes. x2 and x3 are less precise, although the results are still acceptable. Again, we have a look at the overall classification.

p <- plot_hyreg2(data = simulated_data_norm,
           x ="id",
           y = "y",
           id_col ="id",
           class_df_model = give_class(data = simulated_data_norm,
                                model = mod_cont,
                                id_col = "id"))
p <- p +  ggplot2::labs(title = "Figure 6")
p

# ratio of correct classified datapoints
proof <- merge(unique(simulated_data_norm[,c("id","class")]),mod_cont[["id_classes"]], by = "id")
sum((proof$class == proof$mod_comp)/dim(proof)[1])
## [1] 0.95
# cross table of observation types and assigned classes

proof <- merge(simulated_data_norm,
                mod_cont[[k+1]], by = "id")
table(proof$type,proof$mod_comp)
##        
##           1   2
##   DCE_A 100 100
##   TTO   200 200

Assessing the ratio of correct classification reveals that 95 % of the observations are now classified to the correct class. In addition, the data are evenly distributed across classes and data types, as it was generated during the simulation.

Hence, using the input parameter latent = “cont” under control for the grouping variable id leads to very good results, since the true underlying classes could be identified and the estimated parameters come close to the true beta values from the data simulation.

Advanced functionalities

In this section, we present the additional functionalities of the package. We demonstrate how different types of input must be specified in order to estimate well-performing models. At this point, we refrain from providing detailed interpretations of the obtained results.

Estimating models using censored data

The function hyreg2 can be applied to censored data by specifying the parameters lower and/or upper. We simulate a new column y_cens in our dataset simulated_data_norm with censoring at y = 3. This censoring can be incorporated in the model estimation by setting the parameter upper = 3. Equivalent parameter lower can be set to a number, where data have lower bounds.

# upper bound for simulated_data_norm$y at 3
# new variable y_cens

for(i in 1:length(simulated_data_norm$y)){
  if(simulated_data_norm$y[i] <= 3){
    simulated_data_norm$y_cens[i] <- simulated_data_norm$y[i]
  }else{
    simulated_data_norm$y_cens[i] <- 3
  }
}


set.seed(227)
formula <- y_cens ~  -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)

mod_cens <- hyreg2(formula = formula,
                     data =  simulated_data_norm,
                     type =  simulated_data_norm$type,
                     stv = stv,
                     k = k,
                     type_cont = "TTO",
                     type_dich = "DCE_A",
                     opt_method = "L-BFGS-B",
                     control = control,
                     upper = 3,
                     latent = "cont",
                     id_col = "id"
)

We can inspect the results using summary_hyreg2 and plot_hyreg2.

summary_hyreg2(mod_cens)


plot_hyreg2(data = simulated_data_norm,
           x ="id",
           y = "y_cens",
           id_col = "id",
           class_df_model = give_class(data = simulated_data_norm,
                                 model = mod_cens,
                                 id_col = "id"))

Providing start values as a list

In some situations, it may be useful to specify different starting values for the different latent classes. This can potentially shorten or improve the estimation process, for instance, when local rather than global maxima are found during maximum likelihood estimation. For this purpose, the input parameter stv can simply be defined as a list consisting of k named vectors.

set.seed(227)
formula <- y ~  -1 + x1 + x2 + x3 | id
k <- 2
stv1 <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
stv2 <- setNames(c(1,2,0.5,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)

mod_stvl<- hyreg2(formula = formula,
                     data =  simulated_data_norm,
                     type =  simulated_data_norm$type,
                     stv = list(stv1,stv2),
                     k = k,
                     type_cont = "TTO",
                     type_dich = "DCE_A",
                     opt_method = "L-BFGS-B",
                     control = control,
                     latent = "cont",
                     id_col = "id"
)

summary_hyreg2(mod_stvl)

Accounting for heteroscedasticity using start values from previous model estimates

Function get_stv

In some cases, it can be useful to use the estimated values from a simpler model as starting values for a subsequent model. For this purpose, the function get_stv can be used. Its input must be a previously estimated model. If this model was estimated with latent = "both", it is sufficient to provide the model name as input. Additionally, the parameter comp can be used to specify from which class the parameters should be extracted, for example: get_stv(mod = mod_id, comp = "Comp.1"). By default, the first class/component is used. If the model to be used was estimated with latent = "cont" or latent = "dich", the output of hyreg2is a list. In this case, one of the k models from the output must be explicitly selected as input for get_stv, for example: get_stv(mod = mod_cont[[1]]). Additionally, in this case, the input parameter comp must always be "Comp.1", since the model component to be used is selected by specifying the list element.

Accounting for heteroscedasticity

The hyreg2 package includes an additional function and an M-step driver that allow controlling for heteroscedasticity by estimating a separate model for sigma. The corresponding function is called hyreg2_het. This function supports all functionalities of hyreg2 (except formula_type_classic = FALSE) and can therefore be used in an equivalent manner. To account for heteroscedasticity in the data, an additional formula called formula_sigma and an additional vector of starting values for this formula (stv_sigma) can be specified. The provided formula_sigma must be linear and the vector stv_sigma must contain start values for all parameters used in formula_sigma. It is important to note that, when using hyreg2_het, neither stv nor stv_sigma are allowed to include a start value called "sigma", because sigma is estimated with its own formula (in contrast to hyreg2, where "sigma" must always be specified in stv).

set.seed(227)
formula <- y ~  -1 + x1 + x2 + x3 | id
formula_sigma <- y ~ x2 + x3
k <- 2
stv1 <- get_stv(mod_cont[[1]])[-4]  # -4 since "sigma" is on fourth place in return vector
stv2 <- get_stv(mod_cont[[2]])[-4]  # -4 since "sigma" is on fourth place in return vector
stv_sigma <- setNames(rep(0.1,3), c("x2","x3","(Intercept)"))
control <- list(iter.max = 1000, verbose = 4)


mod_het <- hyreg2_het(formula = formula,
                     formula_sigma = formula_sigma,
                     data =  simulated_data_norm,
                     type =  simulated_data_norm$type,
                     stv = list(stv1,stv2),
                     stv_sigma = stv_sigma,
                     k = k,
                     type_cont = "TTO",
                     type_dich = "DCE_A",
                     opt_method = "L-BFGS-B",
                     control = control,
                     latent = "cont",
                     id_col = "id"
)
summary_hyreg2(mod_het)
## [[1]]
## [[1]]$Comp.1
## Maximum likelihood estimation
## 
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method, 
##     optimizer = optimizer, lower = lower, upper = upper)
## 
## Coefficients:
##                Estimate Std. Error z value     Pr(z)    
## x1             0.526189   0.070770  7.4352 1.044e-13 ***
## x2            -0.194231   0.066993 -2.8992  0.003741 ** 
## x3             0.817540   0.070246 11.6382 < 2.2e-16 ***
## theta          1.128744   0.215195  5.2452 1.561e-07 ***
## x2_h          -0.015117   0.057640 -0.2623  0.793117    
## x3_h           0.021948   0.045769  0.4795  0.631558    
## (Intercept)_h  0.091056   0.050145  1.8158  0.069394 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 674.1671 
## 
## 
## [[2]]
## [[2]]$Comp.1
## Maximum likelihood estimation
## 
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method, 
##     optimizer = optimizer, lower = lower, upper = upper)
## 
## Coefficients:
##                 Estimate Std. Error  z value     Pr(z)    
## x1             1.4937706  0.0337419  44.2704 < 2.2e-16 ***
## x2             2.4493005  0.0334537  73.2146 < 2.2e-16 ***
## x3            -0.1201780  0.0356134  -3.3745 0.0007395 ***
## theta          0.3853762  0.2075727   1.8566 0.0633704 .  
## x2_h          -0.0108074  0.0481854  -0.2243 0.8225327    
## x3_h           0.0014345  0.0539905   0.0266 0.9788031    
## (Intercept)_h -0.7236113  0.0500182 -14.4670 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 345.9915

The estimates for sigma can be identified in the model output by the ending "_h".

Estimating models using partial coefficients

It is possible to specify partial coefficients, which are used only on continuous or only on dichotomous data.

Example:

Suppose different models should be specified for continuous and dichotomous data. For example:

  • Continuous data: y ~ x1 + x3

  • Dichotomous data: y ~ x1 + x2

The formula input to hyreg2 must then include all parameters that occur in either (i.e., in at least one) model:

y ~ x1 + x2 + x3

The assignment of parameters to data types is then achieved via the input arguments:

variables_both = ⁠“x1”⁠, variables_cont = ⁠“x3”⁠ and variables_dich = ⁠“x2”⁠.

Every variable included in the provided formula (except the grouping variable ) must appear in exactly one of these vectors. One of the variables_ vectors can also be NULL, if no variables should be used exclusively on this type of the data.

set.seed(227)
formula <- y ~  -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c(colnames(simulated_data_norm)[3:5],c("sigma","theta")))
control <- list(iter.max = 1000, verbose = 4)

mod_partialc <- hyreg2(formula = formula,
                     data =  simulated_data_norm,
                     type =  simulated_data_norm$type,
                     stv = stv,
                     k = k,
                     type_cont = "TTO",
                     type_dich = "DCE_A",
                     opt_method = "L-BFGS-B",
                     control = control,
                     latent = "cont",
                     id_col = "id",
                     variables_both = "x1",
                     variables_cont = "x3",
                     variables_dich = "x2"
)

summary_hyreg2(mod_partialc)

Estimating models using non-classic formulas

All inputs and functionalities that have been considered so far are used with classic R formulas and the default setting (TRUE) for the input formula_type_classic. However, with this setting, it is not possible to estimate non-linear formulas or the 8-parameter model for EQ-5D data(Devlin, Finch, and Parkin 2022) . Therefore, in hyreg2, it is possible to specify model formulas with variables and parameters (non-classic formulas). For example, if a model should be estimated for the mathematical equation \(\frac{1}{e^{x_1 * \beta_{1} + x_2 * \beta_{2}}} = y\), this must be specified as a formula with additional parameters provided, which are not column names in the dataset: formula = y ~ 1/exp(x1 * beta1 + x2 * beta2), where the betas represent the parameters to be estimated and thexs are column names from the dataset. Additionally, formula_type_classic must be set to FALSE. When estimating an intercept, the formula must explicitly include a parameter named "INTERCEPT"(without a corresponding variable from the dataset), e.g. formula = y ~ 1/exp(INTERCEPT + x1 * beta1 + x2 * beta2)

We simulate a new column called y_non for the dataset simulated_data_norm and show how to estimate a model using non-classic formulas.

.

# simulate vector y_non and add this to simulated_data_norm

#library(hyreg2)

### CLASS 1 ###

# Set parameters (same as in simulation above)
set.seed(42)
n_samples_tto1 <- 200
n_samples_dce1 <- 100
sigma1 <- 1.0
theta1 <- 5
beta_non1 <- c(0.5, -0.3, 0.8)  # true values for beta

# simulate matrix
x_tto1 <- matrix(rnorm(n_samples_tto1 * length(beta_non1)), ncol = length(beta_non1))
x_dce1 <- matrix(rnorm(n_samples_dce1 * length(beta_non1)), ncol = length(beta_non1))


### NON-classic ###
# x1,x2 and x3 are columnnames of the dataset, beta1, beta2 and beta3 are the parameters to be estimated
formula1 <- y ~ (x1 * beta1 + x2 * beta2 ) * (x1 *beta1 + x3 * beta3)
x_tto_non1 <- data.frame(x1 = x_tto1[,1], x2 = x_tto1[,2], x3 = x_tto1[,3])
x_dce_non1 <- data.frame(x1 = x_dce1[,1], x2 = x_dce1[,2], x3 = x_dce1[,3])
beta_non1 <- setNames(beta_non1[1:3], c("beta1","beta2","beta3"))

Xb_tto_non1 <- hyreg2:::eval_formula_non(formula1, x_tto_non1, beta_non1)
Xb_dce_non1 <- (hyreg2:::eval_formula_non(formula1, x_dce_non1, beta_non1)) * theta1


# generate y_non
y_tto_non1 <- rnorm(n_samples_tto1, mean = Xb_tto_non1, sd = sigma1)  # continuous outcomes
logistic_tmp_non1 <- 0.5 + 0.5 * tanh(Xb_dce_non1 / 2)
y_dce_non1 <- rbinom(n_samples_dce1, size = 1, prob = logistic_tmp_non1)  # binary outcomes



### CLASS 2 ###

# Set parameters (same as in simulation above)
set.seed(37)
n_samples_tto2 <- 200
n_samples_dce2 <- 100
sigma2 <- 0.5
theta2 <- 2
beta_non2 <- c(1.4, 2.3, -0.2)   # true values for beta


# simulate matrix
x_tto2 <- matrix(rnorm(n_samples_tto2 * length(beta_non2)), ncol = length(beta_non2))
x_dce2 <- matrix(rnorm(n_samples_dce2 * length(beta_non2)), ncol = length(beta_non2))



### NON-classic ###
# x1,x2 and x3 are columnnames of the dataset, beta1, beta2 and beta3 are the parameters to be estimated
formula2 <- y  ~ (x1 * beta1 + x2 * beta2 ) * (x1 * beta1 + x3 * beta3)
x_tto_non2 <- data.frame(x1 = x_tto2[,1], x2 = x_tto2[,2], x3 = x_tto2[,3])
x_dce_non2 <- data.frame(x1 = x_dce2[,1], x2 = x_dce2[,2], x3 = x_dce2[,3])
beta_non2 <- setNames(beta_non2[1:3], c("beta1","beta2","beta3"))

Xb_tto_non2 <- hyreg2:::eval_formula_non(formula2, x_tto_non2, beta_non2)
Xb_dce_non2 <- (hyreg2:::eval_formula_non(formula2, x_dce_non2, beta_non2)) * theta2


# generate y_non
y_tto_non2 <- rnorm(n_samples_tto2, mean = Xb_tto_non2, sd = sigma2)  # continuous outcomes
logistic_tmp_non2 <- 0.5 + 0.5 * tanh(Xb_dce_non2 / 2)
y_dce_non2 <- rbinom(n_samples_dce2, size = 1, prob = logistic_tmp_non2)  # binary outcomes


simulated_data_norm$y_non <- c(y_tto_non1,y_dce_non1,y_tto_non2,y_dce_non2)

# clean environment
rm(n_samples_tto1,
   n_samples_dce1,
   sigma1,
   theta1,
   beta_non1,
   n_samples_tto2,
   n_samples_dce2,
   sigma2,
   theta2,
   beta_non2,
   logistic_tmp1,
   logistic_tmp2,
   logistic_tmp_non1,
   logistic_tmp_non2,
   x_tto_non1,
   x_tto_non2,
   x_dce_non1,
   x_dce_non2,
   Xb_tto_non1,
   Xb_tto_non2,
   Xb_dce_non1,
   Xb_dce_non2,
   x_dce1,
   x_dce2,
   x_tto1,
   x_tto2,
   y_dce_non1,
   y_dce_non2,
   y_tto_non1,
   y_tto_non2,
   formula1,
   formula2
)

If formula_type_classic = FALSE, the provided input for stv must contain the parameter names (e.g. beta1) and not the variable names (e.g. x1) from the dataset. All functionalities described above (expect accounting for heteroscedasticity) can be used with variables-and-parameters-formulas as well. If partial coefficients should be used, the provided variables_vectors must contain the names of the parameters and not the variable names from the dataset.

set.seed(259)
formula <- y_non ~ (x1 * beta1 + x2 * beta2 ) * (x1 * beta1 + x3 * beta3)| id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("beta1","beta2","beta3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)

mod_non <- hyreg2(formula = formula,
                     data =  simulated_data_norm,
                     type =  simulated_data_norm$type,
                     stv = stv,
                     k = k,
                     type_cont = "TTO",
                     type_dich = "DCE_A",
                     opt_method = "L-BFGS-B",
                     control = control,
                     latent = "cont",
                     id_col = "id",
                     formula_type_classic = FALSE
)

The output can be displayed using summary_hyreg2 and the classification can be inspected using plot_hyreg2.

summary_hyreg2(mod_non)

p <- plot_hyreg2(data = simulated_data_norm, 
           x ="id",
           y = "y_non",
           id_col = "id",
           class_df_model = give_class(data = simulated_data_norm,
                                 model = mod_non,
                                 id_col = "id"))
p <- p +  ggplot2::labs(title = "Figure 7")
p

Cameron, Trudy Ann, and Michelle D. James. 1987. “Efficient Estimation Methods for "Closed-Ended" Contingent Valuation Surveys.” The Review of Economics and Statistics 69 (2): 269. https://doi.org/10.2307/1927234.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data Via the EM Algorithm.” Journal of the Royal Statistical Society Series B: Statistical Methodology 39 (1): 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x.
Devlin, Nancy, Aureliano Paolo Finch, and David Parkin. 2022. “Guidance to Users of EQ-5D-5L Value Sets.” In, 213–33. Springer International Publishing. https://doi.org/10.1007/978-3-030-89289-0_5.
Grün, Bettina, and Friedrich Leisch. 2008. “FlexMix Version 2: Finite Mixtures with Concomitant Variables and Varying and Constant Parameters.” Journal of Statistical Software 28 (4). https://doi.org/10.18637/jss.v028.i04.
Leisch, Friedrich. 2004a. FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in r 11. https://doi.org/10.18637/jss.v011.i08.
———. 2004b. “FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression inR.” Journal of Statistical Software 11 (8). https://doi.org/10.18637/jss.v011.i08.
Ramos-Goñi, Juan M., Jose L. Pinto-Prades, Mark Oppe, Juan M. Cabasés, Pedro Serrano-Aguilar, and Oliver Rivero-Arias. 2017. “Valuation and Modeling of EQ-5D-5L Health States Using a Hybrid Approach.” Medical Care 55 (7): e51–58. https://doi.org/10.1097/mlr.0000000000000283.