Our R package hyreg2 allows users to estimate models on
a combination of continuous and dichotomous data using a combined
likelihood. Latent classes can be detected using the EM-algorithm(Dempster, Laird, and Rubin 1977). The package
was originally written to address preference heterogeneity or use a
data-driven subgrouping of participants while estimating EQ-5D value
sets(Devlin, Finch, and Parkin 2022).
However, it can be applied to any situation in which a combined
likelihood of continuous and dichotomous data should be maximized,
e.g. as usual in environmental economics (Cameron
and James 1987).
Functions
The code for estimating models is based on the R package
flexmix(Leisch 2004a), which
offers the option to write custom likelihood functions (M-step-drivers)
for use during the M-step[Leisch
(2004b)](Grün and Leisch 2008). For
the hyreg2 package, we wrote M-step-drivers that implement
the hybrid model likelihood as described by Ramos-Goni (Ramos-Goñi et al. 2017). Additional wrapper
functions hyreg2 and hyreg2_het are included
in the package, such that user do not have to input the drivers to
flexmix themselves.
The package supports both linear and non-linear functions while using
uncensored or censored data. Furthermore, heteroscedasticity can be
taken into account for linear models using the hyreg2_het
function. It is also possible to control for different covariates on the
continuous and dichotomous parts of the data. Start values to be used
can differ between the expected latent classes. Additionally,
hyreg2 offers the option to estimate the latent classes
either on both types of data (continuous and dichotomous) or only on one
type. Once a model has been estimated, the functions
summary_hyreg2, give_class, and
plot_hyreg2 can be used to evaluate the results.
When using hyreg2 to estimate latent class models, it is
essential to account for groups of observations. Controlling for such
dependencies is essential to ensure valid parameter estimation and
inference.
To demonstrate the implications of neglecting to control for groups of observations, we first simulate a dataset with two latent classes. Afterwards we estimate a model without controlling for groups and explain where the model fails. At the end of this chapter, we will demonstrate how a good model fit can be achieved.
We demonstrate the basic functionalities using a part of the the
simulated dataset simulated_data_norm, which is included in
the hyreg2 package. The dataset consists of 7 variables,
with 200 continuous and 100 dichotomous observations in each of the two
latent classes, resulting in a total number of 600 observations. The
variables x1, x2 and x3 contain
random draws from a standard normal distribution. Based on these,
outcome values y are generated according to the linear
predictor
\(x_1 * \beta_{x1} +x_2 * \beta_{x2}+ x_3 * \beta_{x3} = Xb\).
Additionally, the variable y_non is computed according
to the non-linear model \(\frac{1}{\exp(x_1 *
\beta_{x1} + x_2 * \beta_{x2})} = Xb_{non}\).
For the continuous data, y and y_non values
were generated directly using the function rnorm with Xb or
Xb_non as the mean. For the dichotomous data, the values of
Xb or Xb_non are multiplied by a scaling
parameter theta and subsequently simulated using the
function rbinom. The code for the simulation can be seen
below.
# simulate dataframe called simulated_data_norm
library(hyreg2)
library(flexmix)
### CLASS 1 ###
# Set parameters
set.seed(42)
n_samples_tto1 <- 200
n_samples_dce1 <- 100
sigma1 <- 1.0
theta1 <- 5
beta1 <- c(0.5, -0.3, 0.8) # true values for beta
# simulate matrix
x_tto1 <- matrix(rnorm(n_samples_tto1 * length(beta1)), ncol = length(beta1))
x_dce1 <- matrix(rnorm(n_samples_dce1 * length(beta1)), ncol = length(beta1))
#Xb (linear predictor following formula xb = x1*beta1 + x2*beta2 + x3*beta3)
Xb_tto1 <- x_tto1 %*% beta1
Xb_dce1 <- (x_dce1 %*% beta1) * theta1
# generate y
y_tto1 <- rnorm(n_samples_tto1, mean = Xb_tto1, sd = sigma1) # continuous outcomes
logistic_tmp1 <- 0.5 + 0.5 * tanh(Xb_dce1 / 2)
y_dce1 <- rbinom(n_samples_dce1, size = 1, prob = logistic_tmp1) # binary outcomes
### dataset ###
data_tto1 <- data.frame(
type = rep("TTO", n_samples_tto1),
y = y_tto1,
x1 = x_tto1[, 1],
x2 = x_tto1[, 2],
x3 = x_tto1[, 3]
)
data_dce1 <- data.frame(
type = rep("DCE_A", n_samples_dce1),
y = y_dce1,
x1 = x_dce1[, 1],
x2 = x_dce1[, 2],
x3 = x_dce1[, 3]
)
simulated_data1 <- rbind(data_tto1, data_dce1)
### CLASS 2 ###
# Set parameters
set.seed(37)
n_samples_tto2 <- 200
n_samples_dce2 <- 100
sigma2 <- 0.5
theta2 <- 2
beta2 <- c(1.4, 2.3, -0.2) # true values for beta
# simulate matrix
x_tto2 <- matrix(rnorm(n_samples_tto2 * length(beta2)), ncol = length(beta2))
x_dce2 <- matrix(rnorm(n_samples_dce2 * length(beta2)), ncol = length(beta2))
# Xb (linear predictor following formula xb = x1*beta1 + x2*beta2 + x3*beta3)
Xb_tto2 <- x_tto2 %*% beta2
Xb_dce2 <- (x_dce2 %*% beta2) * theta2
# generate y
y_tto2 <- rnorm(n_samples_tto2, mean = Xb_tto2, sd = sigma2) # continuous outcomes
logistic_tmp2 <- 0.5 + 0.5 * tanh(Xb_dce2 / 2)
y_dce2 <- rbinom(n_samples_dce2, size = 1, prob = logistic_tmp2) # binary outcomes
### dataset ###
data_tto2 <- data.frame(
type = rep("TTO", n_samples_tto2),
y = y_tto2,
x1 = x_tto2[, 1],
x2 = x_tto2[, 2],
x3 = x_tto2[, 3]
)
data_dce2 <- data.frame(
type = rep("DCE_A", n_samples_dce2),
y = y_dce2,
x1 = x_dce2[, 1],
x2 = x_dce2[, 2],
x3 = x_dce2[, 3]
)
simulated_data2 <- rbind(data_tto2, data_dce2)
### DATASET with two classes ###
simulated_data1$class <- 1
simulated_data2$class <- 2
# set id numerbs
simulated_data1$id <- c(1:100, 1:100, 1:100)
simulated_data2$id <- c(101:200, 101:200, 101:200)
simulated_data_norm<- rbind(simulated_data1,simulated_data2)
# clean environment
rm(list = setdiff(ls(), "simulated_data_norm"))
The column type indicates whether a data point was
generated according to a normal distribution (continuous data) or a
binomial distribution (dichotomous data). The column id
specifies which continuous observations are linked with which
dichotomous observations, and hence is a grouping variable. For this
dataset, each id is associated with two continuous
observations and one dichotomous observation. The column
class contains the true underlying class memberships of the
data. This is visualized in figure 1 using the function
plot_hyreg2.
# visualize true classes from simulated_data_norm
p <- plot_hyreg2(data = simulated_data_norm,
x = "id",
y = "y",
id_col = "id",
class_df_model = simulated_data_norm[,c("id","class")],
colors = c("#F8766D","#00BFC4"))
p <- p + ggplot2::labs(title = "Figure 1")
p
The function plot_hyreg2 expects a dataset
(data) and two column names of this dataset as characters
(x and y). Additionally, the inputs
id_col and class_df_model must be specified.
id_col must be the name of the grouping variable (as a
character string) while class_df_model must be a dataframe
containing two columns: the grouping variable (which must be named
according to id_col) and the associated classes. In this
case, we already know which id is in which class, since we have
simulated the dataset ourselves. In real-world applications, users
should instead use the give_class function to match each id
to the class estimated by the model. This is explained in further detail
below. plot_hyreg2 returns an ggplot-object,
hence additional ggplot specific parameters can be added,
as demonstrated above.
First, we estimate a model without controlling for groups, i.e., we assign latent classes to each observation independently, without accounting for the fact that, in the simulated dataset, two continuous and one dichotomous observation belong together.
Since our goal is to identify two latent classes, we specify
k = 2. As we would like to estimate the mathematical model
\(x_1 * \beta_{x1} +x_2 * \beta_{x2}+ x_3 *
\beta_{x3} = y\), we specify formula in R as
y ~ -1 + x1 + x2 + x3. Starting values must be provided for
all variables in formula, hence for x1, x2 and
x3. In addition, starting values for the scaling parameter
theta and the parameter sigma (from the
continuous data) must be supplied. These values must necessarily be
named “theta” and “sigma”, otherwise the
function will not recognize them.
The above formula does not include an intercept. If the model to be estimated does contain an intercept or interaction terms, start values must be provided for these as well. They must be named according to the column names of model.matrix(formula,data), e.g. “(Intercept)”.
Now, we specify a named vector stv, which contains the
assumed start values for x1, x2, x3 and for
thetaand sigma.Furthermore, in
hyreg2 the input argument type must be
specified. In our dataset, the column indicating which datapoint was
simulated from a normal distribution and which from a binomial
distribution is simply called type with the entries “TTO”
and “DCE_A”. In this first example, we set
latent = “both”, which implies that the latent classes are
assumed to manifest in both types of data.
set.seed(277)
formula <- y ~ -1 + x1 + x2 + x3
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "both"
)
The function summary_hyreg2 can be used to display the
estimated parameters for both model classes. Please note, that the
outputs for sigma and theta are on a log-scale
and have to be transformed using exp()to get the correct
estimated values.
summary_hyreg2(mod)
## $Comp.1
## Maximum likelihood estimation
##
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method,
## optimizer = optimizer, lower = -Inf, upper = Inf)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## x1 0.492032 0.051579 9.5395 < 2.2e-16 ***
## x2 -0.341833 0.040001 -8.5457 < 2.2e-16 ***
## x3 0.777218 0.068013 11.4275 < 2.2e-16 ***
## sigma 0.107463 0.050454 2.1299 0.03318 *
## theta 3.090415 0.412147 7.4983 6.463e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 614.2638
##
## $Comp.2
## Maximum likelihood estimation
##
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method,
## optimizer = optimizer, lower = -Inf, upper = Inf)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## x1 1.495192 0.035832 41.7281 < 2.2e-16 ***
## x2 2.437797 0.034263 71.1496 < 2.2e-16 ***
## x3 -0.121824 0.037550 -3.2444 0.001177 **
## sigma -0.643939 0.049627 -12.9756 < 2.2e-16 ***
## theta 0.359478 0.205821 1.7466 0.080715 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 383.2935
We visualize the estimated values using the function
plot_hyreg2. The argument id_col refers to the
grouping column. If the model was estimated without grouping (as in this
case), the row names/numbers should simply be provided. The input
class_df_model requires a data frame with two columns,
indicating which group (or observation) from the data has been assigned
to which class by the model. Such a data frame can easily be generated
using the function give_class. Again, if the model was
estimated without controlling for groups, the row names of the dataset
should be used as the input for id_col in the function
give_class.
# if model was estimated without grouping variable
simulated_data_norm$observation <- rownames(simulated_data_norm)
p <- plot_hyreg2(data = simulated_data_norm,
x = "id",
y = "y",
id_col = "observation",
class_df_model = give_class(data = simulated_data_norm,
model = mod,
id_col = "observation"))
p <- p + ggplot2::labs(title = "Figure 2")
p
This plot shows both types of observations (continuous and dichotomous). There seem to be two lines of mostly blue (class 2) datapoints. These are the datapoints corresponding to the dichotomous data. It seems that the model (erroneously!) classified the data by datatypes, rather than by latent classes.
Using plot_hyreg2, it is also possible to display only
the continuous data or only the dichotomous data. For this purpose, the
argument type_to_plot is used. This argument expects a
list with two elements: the first element contains the name
of the column used to differentiate between the data types (as a
character string), and the second element specifies the value associated
with the desired type of data to be included in the plot. This works
analogously to type and type_cont /
type_dich in the hyreg2 input.
# only continuous datapoints
p <- plot_hyreg2(data = simulated_data_norm,
x = "id",
y = "y",
id_col = "observation",
class_df_model = give_class(data = simulated_data_norm,
model = mod,
id_col = "observation"),
type_to_plot = list("type","TTO"))
p <- p + ggplot2::labs(title = "Figure 3")
p
# only dichotomous datapoints
p <- plot_hyreg2(data = simulated_data_norm,
x = "id",
y = "y",
id_col = "observation",
class_df_model = give_class(data = simulated_data_norm,
model = mod,
id_col = "observation"),
type_to_plot = list("type","DCE_A"))
p <- p + ggplot2::labs(title = "Figure 4")
p
Figure 3 shows only the continuous datapoints while figure 4 shows only the dichotomous ones. It can be seen that the dichotomous observations of the lower id numbers were often classified to class 2 while the continuous observations of the lower id numbers often were classified to class 1. Additionally, the visualizations suggest that a considerable proportion of the dichotomous data is concentrated in class 2, whereas the majority of continuous data are assigned to class 1. Even though the estimates of the model parameters for class 1 come quite close to the true parameters, the classification overall is far away from the true underlying classes, since only 58 % of the datapoints were classified correctly. Moreover, it can be useful to have a look at the distribution of data types across the two latent classes.
# ratio of correct classified datapoints
(sum(mod@cluster == simulated_data_norm$class))/dim(simulated_data_norm)[1]
## [1] 0.58
# cross table of observation types and assigned classes
proof <- merge(simulated_data_norm,
give_class(simulated_data_norm,mod,"observation"), by = "observation")
table(proof$type,proof$mod_comp)
##
## 1 2
## DCE_A 41 159
## TTO 247 153
The model classifies more than 75% of the dichotomous data into class 2 and less than 25% into class 1. By contrast, more than 60% of the continuous data points are assigned to class 1. Overall, only 14% of the observations in class 1 are dichotomous, whereas in class 2 their proportion exceeds 50%.
All in all, this shows that it is necessary to control for groups of observations during the estimation process. Otherwise, the model tends to differentiate between dichotomous and continuous data. However, this represents an undesired behavior, as we would expect both classes to exhibit the same proportion of continuous and dichotomous data.
Including a grouping variable in the formula ensures that the class
probabilities in the E-step of the EM-algorithm (via
flexmix) are estimated per group rather than per
observation. This can be done by adding
"| group_variable_name" to formula. Further
details can be found in Gruen and Leisch. This specification has no
impact on the likelihood function used in the M-step. The model
estimates change solely because groups, rather than each observation,
are classified.
set.seed(227)
formula <- y ~ -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod_id <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "both"
)
summary_hyreg2(mod_id)
## $Comp.1
## Maximum likelihood estimation
##
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method,
## optimizer = optimizer, lower = -Inf, upper = Inf)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## x1 0.824467 0.084435 9.7645 < 2.2e-16 ***
## x2 0.924986 0.084452 10.9529 < 2.2e-16 ***
## x3 0.619043 0.078750 7.8608 3.816e-15 ***
## sigma 0.521685 0.040305 12.9435 < 2.2e-16 ***
## theta 1.261959 0.251067 5.0264 4.998e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 1241.915
##
## $Comp.2
## Maximum likelihood estimation
##
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method,
## optimizer = optimizer, lower = -Inf, upper = Inf)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## x1 1.550978 0.061598 25.1788 < 2.2e-16 ***
## x2 2.433191 0.062067 39.2026 < 2.2e-16 ***
## x3 -0.095873 0.059479 -1.6119 0.107
## sigma -0.506043 0.074209 -6.8191 9.160e-12 ***
## theta -1.637745 0.390157 -4.1977 2.697e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 325.574
The model estimates look worse than in the model without controlling for groups. Nevertheless, we will examine the classification achieved by the model in the following plot.
p <- plot_hyreg2(data = simulated_data_norm,
x ="id",
y = "y",
id_col ="id",
class_df_model = unique(give_class(data = simulated_data_norm,
model = mod_id,
id_col = "id")))
p <- p + ggplot2::labs(title = "Figure 5")
p
It can be seen that most of the dichotomous and continuous data points with lower id numbers were assigned to class 1. For higher id numbers, an assignment to class 2 would be expected. Although this assignment occurred in many cases, several data points with higher id numbers were nevertheless incorrectly classified to class 1. An evaluation of the classification indicates that 74% of the data points were correctly assigned, representing a clear improvement compared to the model without control for grouping. An examination of the distribution further reveals that in both classes, approximately 33% of the data are dichotomous and about 67% are continuous. This pattern corresponds to the data-generating process used in the simulation. However, the resulting classes differ in total size. Overall, too many datapoints have been assigned to class 1.
# ratio of correct classified datapoints
(sum(mod_id@cluster == simulated_data_norm$class))/dim(simulated_data_norm)[1]
## [1] 0.74
# cross table of observation types and assigned classes
proof <- merge(simulated_data_norm,
unique(give_class(simulated_data_norm,mod_id,"id")), by = "id")
table(proof$type,proof$mod_comp)
##
## 1 2
## DCE_A 134 66
## TTO 268 132
In the dichotomous data, only the values 0 and 1 can be distinguished and the simulated dataset contains only one dichotomous datapoint for each id number. Hence, no patterns can be detected based on these dataponints. Therefore, it can be concluded, that the data can primarily be distinguished based on the continuous data points. Lower id numbers in general exhibit less variability compared to higher id numbers.
This outcome is consistent with our expectations, as the data were simulated accordingly: The latent classes were determine only by the continuous data, where the beta coefficients were explicitly defined. The subsequent combination of these beta values with theta then produced the dichotomous data points.
If it is assumed (as in this simulation) that the latent classes are
determined only by one of the two data types, the input variable
latent allows this to be taken into account during
estimation. It is possible to estimate latent classes only on continuous
data (latent = “cont”) or only on dichotomous data
(latent = “dich”). If one of these options is chosen, it is
not possible to estimate models without controlling for groups.
Additionally, the input variable id_col must be specified;
it gives the name of the grouping variable as character. Latent classes
are then determined in a first step based only on one type of data. In a
second step, these classes are used to estimate the model parameters on
both types of data.
Here, we want to account for preference heterogeneity in the
continuous data part, and hence we set latent = “cont”.
set.seed(227)
formula <- y ~ -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod_cont <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "cont",
id_col = "id"
)
The output object of hyreg2 in this case is a list of
length k+1 (k models plus one data frame called “id_classes”, which
contains the estimated group memberships from the first estimation
step). The functions summary_hyreg2 and
plot_hyreg2 can still be applied without restrictions, as
they automatically recognize whether the input is a single model or a
list of models.
summary_hyreg2(mod_cont)
## [[1]]
## [[1]]$Comp.1
## Maximum likelihood estimation
##
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method,
## optimizer = optimizer, lower = -Inf, upper = Inf)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## x1 0.526295 0.070000 7.5185 5.542e-14 ***
## x2 -0.191621 0.066908 -2.8640 0.004184 **
## x3 0.815761 0.069974 11.6580 < 2.2e-16 ***
## sigma 0.090409 0.050060 1.8060 0.070918 .
## theta 1.130938 0.215029 5.2595 1.445e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 674.4513
##
##
## [[2]]
## [[2]]$Comp.1
## Maximum likelihood estimation
##
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method,
## optimizer = optimizer, lower = -Inf, upper = Inf)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## x1 1.494318 0.033104 45.1401 < 2.2e-16 ***
## x2 2.451595 0.031989 76.6394 < 2.2e-16 ***
## x3 -0.120248 0.035521 -3.3852 0.0007112 ***
## sigma -0.723616 0.050015 -14.4680 < 2.2e-16 ***
## theta 0.384434 0.207533 1.8524 0.0639684 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 346.0418
The estimates of this model nearly capture the true parameter values,
especially for x1 in both classes. x2 and
x3 are less precise, although the results are still
acceptable. Again, we have a look at the overall classification.
p <- plot_hyreg2(data = simulated_data_norm,
x ="id",
y = "y",
id_col ="id",
class_df_model = give_class(data = simulated_data_norm,
model = mod_cont,
id_col = "id"))
p <- p + ggplot2::labs(title = "Figure 6")
p
# ratio of correct classified datapoints
proof <- merge(unique(simulated_data_norm[,c("id","class")]),mod_cont[["id_classes"]], by = "id")
sum((proof$class == proof$mod_comp)/dim(proof)[1])
## [1] 0.95
# cross table of observation types and assigned classes
proof <- merge(simulated_data_norm,
mod_cont[[k+1]], by = "id")
table(proof$type,proof$mod_comp)
##
## 1 2
## DCE_A 100 100
## TTO 200 200
Assessing the ratio of correct classification reveals that 95 % of the observations are now classified to the correct class. In addition, the data are evenly distributed across classes and data types, as it was generated during the simulation.
Hence, using the input parameter latent = “cont” under
control for the grouping variable id leads to very good results, since
the true underlying classes could be identified and the estimated
parameters come close to the true beta values from the data
simulation.
In this section, we present the additional functionalities of the package. We demonstrate how different types of input must be specified in order to estimate well-performing models. At this point, we refrain from providing detailed interpretations of the obtained results.
The function hyreg2 can be applied to censored data by
specifying the parameters lower and/or upper.
We simulate a new column y_cens in our dataset
simulated_data_norm with censoring at y = 3. This censoring
can be incorporated in the model estimation by setting the parameter
upper = 3. Equivalent parameter lower can be
set to a number, where data have lower bounds.
# upper bound for simulated_data_norm$y at 3
# new variable y_cens
for(i in 1:length(simulated_data_norm$y)){
if(simulated_data_norm$y[i] <= 3){
simulated_data_norm$y_cens[i] <- simulated_data_norm$y[i]
}else{
simulated_data_norm$y_cens[i] <- 3
}
}
set.seed(227)
formula <- y_cens ~ -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod_cens <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
upper = 3,
latent = "cont",
id_col = "id"
)
We can inspect the results using summary_hyreg2 and
plot_hyreg2.
summary_hyreg2(mod_cens)
plot_hyreg2(data = simulated_data_norm,
x ="id",
y = "y_cens",
id_col = "id",
class_df_model = give_class(data = simulated_data_norm,
model = mod_cens,
id_col = "id"))
In some situations, it may be useful to specify different starting
values for the different latent classes. This can potentially shorten or
improve the estimation process, for instance, when local rather than
global maxima are found during maximum likelihood estimation. For this
purpose, the input parameter stv can simply be defined as a
list consisting of k named vectors.
set.seed(227)
formula <- y ~ -1 + x1 + x2 + x3 | id
k <- 2
stv1 <- setNames(c(0.2,0,1,1,1),c("x1","x2","x3","sigma","theta"))
stv2 <- setNames(c(1,2,0.5,1,1),c("x1","x2","x3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod_stvl<- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = list(stv1,stv2),
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "cont",
id_col = "id"
)
summary_hyreg2(mod_stvl)
Function get_stv
In some cases, it can be useful to use the estimated values from a
simpler model as starting values for a subsequent model. For this
purpose, the function get_stv can be used. Its input must
be a previously estimated model. If this model was estimated with
latent = "both", it is sufficient to provide the model name
as input. Additionally, the parameter comp can be used to
specify from which class the parameters should be extracted, for
example: get_stv(mod = mod_id, comp = "Comp.1"). By
default, the first class/component is used. If the model to be used was
estimated with latent = "cont" or
latent = "dich", the output of hyreg2is a
list. In this case, one of the k models from the output must be
explicitly selected as input for get_stv, for example:
get_stv(mod = mod_cont[[1]]). Additionally, in this case,
the input parameter comp must always be
"Comp.1", since the model component to be used is selected
by specifying the list element.
Accounting for heteroscedasticity
The hyreg2 package includes an additional function and
an M-step driver that allow controlling for heteroscedasticity by
estimating a separate model for sigma. The corresponding function is
called hyreg2_het. This function supports all
functionalities of hyreg2 (except
formula_type_classic = FALSE) and can therefore be used in
an equivalent manner. To account for heteroscedasticity in the data, an
additional formula called formula_sigma and an additional
vector of starting values for this formula (stv_sigma) can
be specified. The provided formula_sigma must be linear and
the vector stv_sigma must contain start values for all
parameters used in formula_sigma. It is important to note
that, when using hyreg2_het, neither stv nor
stv_sigma are allowed to include a start value called
"sigma", because sigma is estimated with its own formula
(in contrast to hyreg2, where "sigma" must
always be specified in stv).
set.seed(227)
formula <- y ~ -1 + x1 + x2 + x3 | id
formula_sigma <- y ~ x2 + x3
k <- 2
stv1 <- get_stv(mod_cont[[1]])[-4] # -4 since "sigma" is on fourth place in return vector
stv2 <- get_stv(mod_cont[[2]])[-4] # -4 since "sigma" is on fourth place in return vector
stv_sigma <- setNames(rep(0.1,3), c("x2","x3","(Intercept)"))
control <- list(iter.max = 1000, verbose = 4)
mod_het <- hyreg2_het(formula = formula,
formula_sigma = formula_sigma,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = list(stv1,stv2),
stv_sigma = stv_sigma,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "cont",
id_col = "id"
)
summary_hyreg2(mod_het)
## [[1]]
## [[1]]$Comp.1
## Maximum likelihood estimation
##
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method,
## optimizer = optimizer, lower = lower, upper = upper)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## x1 0.526189 0.070770 7.4352 1.044e-13 ***
## x2 -0.194231 0.066993 -2.8992 0.003741 **
## x3 0.817540 0.070246 11.6382 < 2.2e-16 ***
## theta 1.128744 0.215195 5.2452 1.561e-07 ***
## x2_h -0.015117 0.057640 -0.2623 0.793117
## x3_h 0.021948 0.045769 0.4795 0.631558
## (Intercept)_h 0.091056 0.050145 1.8158 0.069394 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 674.1671
##
##
## [[2]]
## [[2]]$Comp.1
## Maximum likelihood estimation
##
## Call:
## bbmle::mle2(minuslogl = logLik2, start = stv_new, method = opt_method,
## optimizer = optimizer, lower = lower, upper = upper)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## x1 1.4937706 0.0337419 44.2704 < 2.2e-16 ***
## x2 2.4493005 0.0334537 73.2146 < 2.2e-16 ***
## x3 -0.1201780 0.0356134 -3.3745 0.0007395 ***
## theta 0.3853762 0.2075727 1.8566 0.0633704 .
## x2_h -0.0108074 0.0481854 -0.2243 0.8225327
## x3_h 0.0014345 0.0539905 0.0266 0.9788031
## (Intercept)_h -0.7236113 0.0500182 -14.4670 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 345.9915
The estimates for sigma can be identified in the model output by the
ending "_h".
It is possible to specify partial coefficients, which are used only on continuous or only on dichotomous data.
Example:
Suppose different models should be specified for continuous and dichotomous data. For example:
Continuous data: y ~ x1 + x3
Dichotomous data: y ~ x1 + x2
The formula input to hyreg2 must then
include all parameters that occur in either (i.e., in at least one)
model:
y ~ x1 + x2 + x3
The assignment of parameters to data types is then achieved via the input arguments:
variables_both = “x1”,
variables_cont = “x3” and
variables_dich = “x2”.
Every variable included in the provided formula (except
the grouping variable ) must appear in exactly one of these vectors. One
of the variables_ vectors can also be NULL, if no variables
should be used exclusively on this type of the data.
set.seed(227)
formula <- y ~ -1 + x1 + x2 + x3 | id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c(colnames(simulated_data_norm)[3:5],c("sigma","theta")))
control <- list(iter.max = 1000, verbose = 4)
mod_partialc <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "cont",
id_col = "id",
variables_both = "x1",
variables_cont = "x3",
variables_dich = "x2"
)
summary_hyreg2(mod_partialc)
All inputs and functionalities that have been considered so far are
used with classic R formulas and the default setting (TRUE)
for the input formula_type_classic. However, with this
setting, it is not possible to estimate non-linear formulas or the
8-parameter model for EQ-5D data(Devlin, Finch,
and Parkin 2022) . Therefore, in hyreg2, it is
possible to specify model formulas with variables and parameters
(non-classic formulas). For example, if a model should be estimated for
the mathematical equation \(\frac{1}{e^{x_1 *
\beta_{1} + x_2 * \beta_{2}}} = y\), this must be specified as a
formula with additional parameters provided, which are not column names
in the dataset:
formula = y ~ 1/exp(x1 * beta1 + x2 * beta2), where the
betas represent the parameters to be estimated and
thexs are column names from the dataset. Additionally,
formula_type_classic must be set to FALSE.
When estimating an intercept, the formula must explicitly include a
parameter named "INTERCEPT"(without a corresponding
variable from the dataset),
e.g. formula = y ~ 1/exp(INTERCEPT + x1 * beta1 + x2 * beta2)
We simulate a new column called y_non for the dataset
simulated_data_norm and show how to estimate a model using non-classic
formulas.
.
# simulate vector y_non and add this to simulated_data_norm
#library(hyreg2)
### CLASS 1 ###
# Set parameters (same as in simulation above)
set.seed(42)
n_samples_tto1 <- 200
n_samples_dce1 <- 100
sigma1 <- 1.0
theta1 <- 5
beta_non1 <- c(0.5, -0.3, 0.8) # true values for beta
# simulate matrix
x_tto1 <- matrix(rnorm(n_samples_tto1 * length(beta_non1)), ncol = length(beta_non1))
x_dce1 <- matrix(rnorm(n_samples_dce1 * length(beta_non1)), ncol = length(beta_non1))
### NON-classic ###
# x1,x2 and x3 are columnnames of the dataset, beta1, beta2 and beta3 are the parameters to be estimated
formula1 <- y ~ (x1 * beta1 + x2 * beta2 ) * (x1 *beta1 + x3 * beta3)
x_tto_non1 <- data.frame(x1 = x_tto1[,1], x2 = x_tto1[,2], x3 = x_tto1[,3])
x_dce_non1 <- data.frame(x1 = x_dce1[,1], x2 = x_dce1[,2], x3 = x_dce1[,3])
beta_non1 <- setNames(beta_non1[1:3], c("beta1","beta2","beta3"))
Xb_tto_non1 <- hyreg2:::eval_formula_non(formula1, x_tto_non1, beta_non1)
Xb_dce_non1 <- (hyreg2:::eval_formula_non(formula1, x_dce_non1, beta_non1)) * theta1
# generate y_non
y_tto_non1 <- rnorm(n_samples_tto1, mean = Xb_tto_non1, sd = sigma1) # continuous outcomes
logistic_tmp_non1 <- 0.5 + 0.5 * tanh(Xb_dce_non1 / 2)
y_dce_non1 <- rbinom(n_samples_dce1, size = 1, prob = logistic_tmp_non1) # binary outcomes
### CLASS 2 ###
# Set parameters (same as in simulation above)
set.seed(37)
n_samples_tto2 <- 200
n_samples_dce2 <- 100
sigma2 <- 0.5
theta2 <- 2
beta_non2 <- c(1.4, 2.3, -0.2) # true values for beta
# simulate matrix
x_tto2 <- matrix(rnorm(n_samples_tto2 * length(beta_non2)), ncol = length(beta_non2))
x_dce2 <- matrix(rnorm(n_samples_dce2 * length(beta_non2)), ncol = length(beta_non2))
### NON-classic ###
# x1,x2 and x3 are columnnames of the dataset, beta1, beta2 and beta3 are the parameters to be estimated
formula2 <- y ~ (x1 * beta1 + x2 * beta2 ) * (x1 * beta1 + x3 * beta3)
x_tto_non2 <- data.frame(x1 = x_tto2[,1], x2 = x_tto2[,2], x3 = x_tto2[,3])
x_dce_non2 <- data.frame(x1 = x_dce2[,1], x2 = x_dce2[,2], x3 = x_dce2[,3])
beta_non2 <- setNames(beta_non2[1:3], c("beta1","beta2","beta3"))
Xb_tto_non2 <- hyreg2:::eval_formula_non(formula2, x_tto_non2, beta_non2)
Xb_dce_non2 <- (hyreg2:::eval_formula_non(formula2, x_dce_non2, beta_non2)) * theta2
# generate y_non
y_tto_non2 <- rnorm(n_samples_tto2, mean = Xb_tto_non2, sd = sigma2) # continuous outcomes
logistic_tmp_non2 <- 0.5 + 0.5 * tanh(Xb_dce_non2 / 2)
y_dce_non2 <- rbinom(n_samples_dce2, size = 1, prob = logistic_tmp_non2) # binary outcomes
simulated_data_norm$y_non <- c(y_tto_non1,y_dce_non1,y_tto_non2,y_dce_non2)
# clean environment
rm(n_samples_tto1,
n_samples_dce1,
sigma1,
theta1,
beta_non1,
n_samples_tto2,
n_samples_dce2,
sigma2,
theta2,
beta_non2,
logistic_tmp1,
logistic_tmp2,
logistic_tmp_non1,
logistic_tmp_non2,
x_tto_non1,
x_tto_non2,
x_dce_non1,
x_dce_non2,
Xb_tto_non1,
Xb_tto_non2,
Xb_dce_non1,
Xb_dce_non2,
x_dce1,
x_dce2,
x_tto1,
x_tto2,
y_dce_non1,
y_dce_non2,
y_tto_non1,
y_tto_non2,
formula1,
formula2
)
If formula_type_classic = FALSE, the provided input for
stv must contain the parameter names (e.g. beta1) and not
the variable names (e.g. x1) from the dataset. All functionalities
described above (expect accounting for heteroscedasticity) can be used
with variables-and-parameters-formulas as well. If partial coefficients
should be used, the provided variables_vectors must contain the names of
the parameters and not the variable names from the dataset.
set.seed(259)
formula <- y_non ~ (x1 * beta1 + x2 * beta2 ) * (x1 * beta1 + x3 * beta3)| id
k <- 2
stv <- setNames(c(0.2,0,1,1,1),c("beta1","beta2","beta3","sigma","theta"))
control <- list(iter.max = 1000, verbose = 4)
mod_non <- hyreg2(formula = formula,
data = simulated_data_norm,
type = simulated_data_norm$type,
stv = stv,
k = k,
type_cont = "TTO",
type_dich = "DCE_A",
opt_method = "L-BFGS-B",
control = control,
latent = "cont",
id_col = "id",
formula_type_classic = FALSE
)
The output can be displayed using summary_hyreg2 and the
classification can be inspected using plot_hyreg2.
summary_hyreg2(mod_non)
p <- plot_hyreg2(data = simulated_data_norm,
x ="id",
y = "y_non",
id_col = "id",
class_df_model = give_class(data = simulated_data_norm,
model = mod_non,
id_col = "id"))
p <- p + ggplot2::labs(title = "Figure 7")
p