| Version: | 0.1-1 | 
| Author: | Giampiero Marra [aut, cre] | 
| Maintainer: | Giampiero Marra <giampiero.marra@ucl.ac.uk> | 
| Title: | Data Sets for Copula Additive Distributional Regression Using R | 
| Description: | Data sets used in the book Marra and Radice (2025, ISBN:9781032973111) "Copula Additive Distributional Regression Using R", for illustrating the fitting of various joint (and univariate) regression models, with several types of covariate effects, in the presence of equations' errors association. | 
| Depends: | R (≥ 3.6.0) | 
| Suggests: | GJRM | 
| License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] | 
| NeedsCompilation: | no | 
| Packaged: | 2025-06-25 16:25:43 UTC; Giampiero | 
| Repository: | CRAN | 
| Date/Publication: | 2025-06-29 10:30:02 UTC | 
AREDS: Age-related Eye Disease Study
Description
Real dataset of bivariate interval and right censored data with 628 subjects 
and three covariates. The dataset is a reshaped version of the AREDS data from the CopulaCenR package. The dataset 
was selected from the Age-related Eye Disease Study (AREDS Group, 1999). The two events are the 
progression times (in years) to late-AMD in the left and right eyes.
Usage
data(areds)
Format
war is a 628 row data frame with the following columns:
- t11, t12
- left and right bounds of the intervals for the left eye. If - t12 = NAthen the observation is right-censored.
- t21, t22
- left and right bounds of the intervals for the right eye. If - t22 = NAthen the observation is right-censored.
- SevScore1, SevScore2
- baseline AMD severity scores for left and right eyes, respectively. Possible values are: 4, 5, 6, 7, 8. 
- age
- age at baseline. 
- rs2284665
- a genetic variant covariate highly associated with late-AMD progression. Possible values are: 0, 1, 2. 
- cens1, cens2
- type of censoring for left and right eyes. 
- cens
- joint censoring indicator for left and right eyes. 
Source
Data are from:
AREDS Group (1999), The Age-Related Eye Disease Study (AREDS): design implications. AREDS report no. 1. Control Clinical Trials, 20, 573-600.
Blood pressure data in children
Description
Blood pressure data in 11 year old children. The dataset is a subsample from Solomon-Moore et al. (2020).
Usage
data(bpc)
Format
bpc is a 1052 row data frame with the following columns: 
- sbp
- Systolic Blood Pressure (mmHg). 
- dbp
- Diastolic Blood Pressure (mmHg). 
- gender
- 1 = Male, 2 = Female. 
- bmi
- Body Mass Index. 
- mvpa
- Average minutes of moderate to vigorous physical activity per day. 
- sed
- Average sedentary minutes per day. 
Source
Data are from Solomon-Moore E, Salway R, Emm-Collison L, Thompson JL, Sebire SJ, Lawlor DA, Jago R (PI), 2020.
ACDIS data
Description
Fictitious data designed to closely replicate the characteristics and patterns observed in the Africa Centre Demographic Information System (ACDIS).
Usage
data(cd4)
Format
cd4 is a 2645 row data frame with the following columns: 
- cd4.count
- CD4 count measurements. 
- hiv
- Binary variable indicating whether an individual is HIV positive (hiv = 1) or not (hiv = 0). 
- age
- Age in years. 
- location
- Three levels: PER, RUR, URB. 
- marital
- Six levels: Married, Polygamous, Divorced/Separated/Widowed, Engaged, Never Married, Under Legal Age. 
- water
- If present or not. 
- education
- Four levels: None, Primary, Junior Secondary, Upper Secondary. 
- distance1
- Km to nearest primary school. 
- distance2
- Km to nearest secondary school. 
Source
The data have been produced as described in:
Tanser F. at al., (2007), Cohort Profile: Africa Centre Demographic Information System (ACDIS) and population-based HIV survey. International Journal of Epidemiology, 37(5), 956-962.
Simulated data with two endogenous variables
Description
Simulated data with two endogenous variables and binary outcome.
Usage
data(dataDE)
Format
dataDE is a 2000 row data frame with the following columns: 
- y1
- First endogenous variable. 
- y2
- Second endogenous variable. 
- y3
- Binary outcome. 
- x1, x2
- Covariates. 
- x3
- Covariate influencing only - y1.
- x4
- Covariate influencing only - y2.
Examples
# Data have been simulated as shown below
n  <- 2000
x1 <- round(runif(n))
x2 <- runif(n)
x3 <- runif(n)
x4 <- rnorm(n)
u  <- rnorm(n)
y1 <- ifelse(-1.55 +               x1 - x2 + x3 + u + rnorm(n) > 0, 1, 0)
y2 <- ifelse(-0.25 -           0.5*x1 + x2 + x4 + u + rnorm(n) > 0, 1, 0)
y3 <- ifelse(-0.75 + 0.5*y1 - y2 + x1 + x2      + u + rnorm(n) > 0, 1, 0)
dataDE <- data.frame(y1, y2, y3, x1, x2, x3, x4)
Simulated data with double sample selection
Description
Simulated data with double sample selection and binary outcome.
Usage
data(dataDSS)
Format
dataDSS is a 10000 row data frame with the following columns: 
- y1
- First selection. 
- y2
- Second selection. 
- y3
- Binary outcome. 
- x1, x2
- Covariates. 
- x3
- Covariate influencing only - y1.
- x4
- Covariate influencing only - y2.
- y3.o
- Original outcome, without missingness. 
Examples
# Data have been simulated as shown below
n  <- 10000
x1 <- round(runif(n))
x2 <- runif(n)
x3 <- runif(n)
x4 <- rnorm(n)
u  <- rnorm(n)
y1 <- ifelse(-1.55 +     x1 - x2 + x3 + u + rnorm(n) > 0, 1, 0)
y2 <- ifelse(-0.25 - 0.5*x1 + x2 + x4 + u + rnorm(n) > 0, 1, 0)
y3 <- y3.o <- ifelse( -0.75 + x1 + x2 + u + rnorm(n) > 0, 1, 0)
y2 <- y2*y1
y3 <- y3*y2
y3 <- ifelse(y2 == 0, NA, y3)
dataDSS <- data.frame(y1, y2, y3, x1, x2, x3, x4, y3.o)
World Happiness Report Data
Description
Data from the 2019 World Happiness Report, an annual publication of the United Nations Sustainable Development Solutions Network.
Usage
data(happy)
Format
happy is a 155 row data frame with the following columns:
- country
- Country. 
- gdp
- Gross domestic product per capita. 
- support
- Indicator of social support (or having someone to count on in times of trouble) calculated at national level. 
- hle
- Indicator of healthy life expectancies at birth. 
- freedom
- Freedom to make life choices is the national average of responses to the question: Are you satisfied or dissatisfied with your freedom to choose what you do with your life? 
- generosity
- Generosity is the residual of regressing national average of response to the question: Have you donated money to a charity in the past month? on GDP per capita. 
- corruption
- Corruption Perception: The measure is the national average of the survey responses to two questions in the: Is corruption widespread throughout the government or not? and Is corruption widespread within businesses or not? The overall perception is just the average of the two 0-or-1 responses. 
- score
- Subjective well-being. 1 low, 2 medium low, 3 medium, 4 high. 
Hiring Incentive Experiment - HIE
Description
Full description available at the web link below.
Usage
data(hie)
Format
hie is a 7734 row data frame with the following columns:
- agree
- Equal to 1 if the individual is in the HIE group and agreed to participate, and 0 if the individual is assigned to the control group or refuses to participate. 
- bonus
- Random allocation variable equal to 1 if the individual/employer was assigned to the hiring incentive experiment group and 0 to the control group. This is the IV. 
- benefit
- Weekly benefit amount + dependents' allowance. 
- unemp.dur
- Weeks of benefits. 
- status
- Equal to 1 if unemp.dur < 26 and 0 otherwise. 
- age
- Age of claimant. 
- gender
- 1 = male and 0 = female. 
- ethnicity
- 1 = black and 0 otherwise. 
- prearn
- Claimant's pre-claim earnings. 
Source
https://www.upjohn.org/data-tools/employment-research-data-center/illinois-unemployment-incentive-experiments
HIV Zambian data
Description
HIV Zambian data by region, together with polygons describing the regions' shapes.
Usage
data(hiv)
data(hiv.polys)
Format
hiv is a 6416 row data frame with the following columns: 
- consent
- binary variable indicating consent to test for HIV. 
- status
- binary variable indicating whether an individual is HIV positive (status = 1) or not (status = 0). 
- age
- age in years. 
- education
- years of education. 
- wealth
- wealth index. 
- region
- code identifying region, and matching - names(hiv.polys). It can take nine possible values: 1 central, 2 copperbelt, 3 eastern, 4 luapula, 5 lusaka, 6 northwestern, 7 northern, 8 southern, 9 western.
- marital
- never married, currently married, formerly married. 
- std
- had a sexually transmitted disease. 
- highhiv
- had high risk sex. 
- partner
- number of partners. 
- condom
- used condom during last intercourse. 
- aidscare
- equal to 1 if would care for an HIV-infected relative. 
- knowsdiedofaids
- equal to 1 if know someone who died of HIV. 
- evertestedHIV
- equal to 1 if previously tested for HIV. 
- smoke
- smoker or not. 
- ethnicity
- bemba, lunda (luapula), lala, ushi, lamba, tonga, luvale, lunda (northwestern), mbunda, kaonde, lozi, chewa, nsenga, ngoni, mambwe, namwanga, tumbuka, other. 
- language
- English, Bemba, Lozi, Nyanja, Tonga, other. 
- interviewerID
- interviewer identifier. 
- agehadsex
- age the individual had sex. 
- religion
- four categories. 
- sw
- survey weights. 
hiv.polys contains the polygons defining the areas in the format described below.
Details
The data frame hiv relates to the regions whose boundaries are coded in hiv.polys.
hiv.polys[[i]] is a 2 column matrix, containing the vertices of the polygons defining the boundary of the ith 
region. names(hiv.polys) matches hiv$region (order unimportant).
Source
The data have been produced as described in:
McGovern M.E., Barnighausen T., Marra G. and Radice R. (2015), On the Assumption of Joint Normality in Selection Models: A Copula Approach Applied to Estimating HIV Prevalence. Epidemiology, 26(2), 229-237.
References
Marra G., Radice R., Barnighausen T., Wood S.N. and McGovern M.E. (2017), A Simultaneous Equation Approach to Estimating HIV Prevalence with Non-Ignorable Missing Responses. Journal of the American Statistical Association, 112(518), 484-496.
U.S. hospital data from the state of Virginia
Description
Data on 978 randomly selected patients admitted between January and September 2014 to an over-500-bed medical center (Lewis Gale Medical Center) in the state of Virginia.
Usage
data(hospital)
Format
hospital is a 978 row data frame with the following columns:
- los
- Patient length of hospital stay (in days). 
- died
- In-hospital mortality. 1 dead, 0 alive. 
- age
- Age of the patient. 
- gender
- Either male or female 
- bmi
- Body mass index. 
- severity
- Subjective assessment of severity level of patient. Value between 1 and 4, with 1 representing the lowest severity level. 
- risk
- Subjective assessment of risk of dying. Value between 1 and 4, with 1 representing the lowest level. 
- sp02
- Oxygen saturation level. 
- sbp
- Systolic blood pressure. 
- dbp
- Diastolic blood pressure. 
- pulse
- Pulse rate. 
- respiratory
- Respiratory rate. 
- avpu
- AVPU score (A: alert, V: responding to voice, P: responding to painful stimuli, U: unresponsive). 
- temp
- Temperature. 
Source
Azadeh-Fard N, Ghaffarzadegan N, Camelio JA (2016), Can a Patient's In-Hospital Length of Stay and Mortality Be Explained by Early-Risk Assessments?, PLoS ONE 11(9): e0162976.
Infant statistic data from North Carolina
Description
Individual-level infant mortality data on 20000 randomly selected births of female babies in the U.S. state of North Carolina, in 2008, together with polygons describing the county shapes.
Usage
data(infants)
data(NC.polys)
Format
infants is a 20000 row data frame with the following columns: 
- county
- Number code identifying North Carolina county in which birth occurred, and matching - names(NC.polys). It can take 100 possible values.
- age
- Age of mother. 
- wksgest
- Completed weeks of gestation. 
- marital
- Equal to 1 if married, and 0 otherwise. 
- grams
- Infant's birth weight. 
- lbw
- Equal to 1 if infant's birth weight < 2500 grams, and 0 otherwise. 
- ethnicity
- Four categories of ethnicity: White, Hispanic, Black, Other. 
- educ
- Education of mother: Primary, Secondary, Tertiary. 
- smoke
- Equal to 1 if smoker, and 0 otherwise. 
- firstbirth
- Equal to 1 if it was the mother's first birth, and 0 otherwise. 
- ptb
- Equal to 1 if completed weeks of gestation < 37. 
NC.polys contains the polygons defining the areas in the format described below.
Details
The data frame infants relates to the counties whose boundaries are coded in NC.polys.
NC.polys[[i]] is a 2 column matrix, containing the vertices of the polygons defining the boundary of the ith 
county. names(NC.polys) matches infants$county (order unimportant).
Source
The data were compiled by the North Carolina State Center for Health Statistics (https://schs.dph.ncdhhs.gov/).
MEPS: Medical Expenditure Panel Survey (year 2012)
Description
Subsample of the 2012 MEPS data, collected and published by the U.S. Agency for Healthcare Research and Quality.
Usage
data(meps)
Format
meps is a 10638 row data frame with the following columns:
- general
- General health: 1 excellent, 2 very good, 3 good, 4 fair, 5 poor. 
- mental
- Mental health (as above). 
- bmi
- Body mass index. 
- income
- Income. 
- age
- Age. 
- gender
- Male 1, Female 0. 
- ethnicity
- 1 white, 2 black, 3 native american, 4 others. 
- education
- Education in years. 
- region
- 1 Northeast, 2 Midwest, 3 South, 4 West. 
- hypertension
- Equal to 1 if hypertension present and 0 otherwise. 
- hyperlipidemia
- Equal to 1 if hyperlipidemia present and 0 otherwise. 
- dvisit
- Number of doctor (physicians) visits. 
- ndvisit
- Number of non doctor visits (non-physician providers). 
- dvexpend
- Expenditure on doctor visits. 
- ndvexpend
- Expenditure on non doctor visits. 
Source
https://meps.ahrq.gov
Civil war data
Description
Civil war data from Fearon and Laitin (2003).
Usage
data(war)
Format
war is a 6326 row data frame with the following columns:
- onset
- equal to 1 for all country-years in which a civil war started. 
- instab
- equal to 1 if unstable government. 
- oil
- equal to 1 for oil exporter country. 
- cwar
- equal to 1 if the country had a distinct civil war ongoing in the previous year. 
- gdp
- GDP per capita (measured as thousands of 1985 U.S. dollars) lagged one year. 
- ncontig
- equal to 1 for non-contiguous state. 
- nwstate
- equal to 1 for new state. 
- lpop
- log(population size). 
- lmnt
- log(mountainous). 
- ethfrac
- measure of ethnic fractionalization (calculated as the probability that two randomly drawn individuals from a country are not from the same ethnicity). 
- relfrac
- measure of religious fractionalisation. 
- poldem
- measure of political democracy (ranges from -10 to 10) lagged one year. 
Source
Data are from:
Fearon J.D., Laitin D.D. (2003), Ethnicity, Insurgency, and Civil War. The American Political Science Review, 97, 75-90.