| Type: | Package | 
| Title: | A Curated Collection of 'Causal Inference' Datasets and Tools | 
| Version: | 0.1.0 | 
| Maintainer: | Tomás Valderrama <tomasvm2004@gmail.com> | 
| Description: | Provides a comprehensive set of datasets and tools for 'causal inference' research. The package includes data from clinical trials, cancer studies, epidemiological surveys, environmental exposures, and health-related observational studies. Designed to facilitate causal analysis, risk assessment, and advanced statistical modeling, it leverages datasets from packages such as 'causalOT', 'survival', 'causalPAF', 'evident', 'melt', and 'sanon'. The package is inspired by the foundational work of Pearl (2009) <doi:10.1017/CBO9780511803161> on causal inference frameworks. | 
| License: | GPL-3 | 
| URL: | https://github.com/Toby-codigos/ForCausality, https://toby-codigos.github.io/ForCausality/ | 
| BugReports: | https://github.com/Toby-codigos/ForCausality/issues | 
| Encoding: | UTF-8 | 
| LazyData: | true | 
| Suggests: | ggplot2, dplyr, testthat (≥ 3.0.0), knitr, rmarkdown | 
| RoxygenNote: | 7.3.3 | 
| Config/testthat/edition: | 3 | 
| VignetteBuilder: | knitr | 
| NeedsCompilation: | no | 
| Packaged: | 2025-10-22 02:18:33 UTC; tomis | 
| Author: | Tomás Valderrama [aut, cre] | 
| Depends: | R (≥ 3.5.0) | 
| Repository: | CRAN | 
| Date/Publication: | 2025-10-25 12:40:22 UTC | 
ForCausality: A Curated Collection of Causal Inference Datasets and Tools
Description
Provides a comprehensive set of datasets and tools for causal inference research. The package includes data from clinical trials, cancer studies, epidemiological surveys, environmental exposures, and health-related observational studies.
Details
ForCausality: A Curated Collection of Causal Inference Datasets and Tools
 
A Curated Collection of Causal Inference Datasets and Tools
Author(s)
Maintainer: Tomás Valderrama tomasvm2004@gmail.com
See Also
Useful links:
Benzene Exposure and Chromosome Damage Data
Description
This dataset, Benzene_df, is a data frame containing indicators of chromosome damage related to benzene exposure, alcohol consumption, and smoking habits. The dataset consists of 78 observations and 5 variables, including age, exposure, and lifestyle factors. Some observations may contain missing values.
Usage
data(Benzene_df)
Format
A data frame with 78 observations and 5 variables:
- age
- Age of the subject (integer) 
- exposure
- Benzene exposure indicator (integer) 
- alcohol
- Alcohol consumption indicator (integer) 
- smoking
- Smoking indicator (numeric) 
- totalplus
- Chromosome damage measure (numeric) 
Details
The dataset name has been kept as 'Benzene_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the evident package version 1.0.4
Clothianidin Concentration in Maize Plants
Description
This dataset, Cloth_df, is a data frame containing measurements of clothianidin concentration in maize plants under different treatments. The dataset consists of 102 observations and 3 variables, including block identifiers, treatment types, and measured concentrations. Some observations may contain missing values.
Usage
data(Cloth_df)
Format
A data frame with 102 observations and 3 variables:
- blk
- Block identifier (factor) 
- trt
- Treatment type (factor) 
- clo
- Clothianidin concentration (numeric) 
Details
The dataset name has been kept as 'Cloth_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the melt package version 1.11.4
Chemotherapy Data for Stage B/C Colon Cancer
Description
This dataset, Colon_df, contains data from a clinical trial of chemotherapy for patients with Stage B/C colon cancer. The dataset includes 1,858 observations and 16 variables, providing information on patient demographics, treatment assignment, disease characteristics, and outcomes. Some observations contain missing values.
Usage
data(Colon_df)
Format
A data frame with 1,858 observations and 16 variables:
- id
- Patient identifier (numeric) 
- study
- Study number (numeric) 
- rx
- Treatment group (factor) 
- sex
- Sex of the patient (numeric) 
- age
- Age of the patient in years (numeric) 
- obstruct
- Obstruction present (numeric indicator) 
- perfor
- Perforation present (numeric indicator) 
- adhere
- Adherence to adjacent structures (numeric indicator) 
- nodes
- Number of lymph nodes with cancer (numeric) 
- status
- Patient status (numeric indicator) 
- differ
- Tumor differentiation (numeric) 
- extent
- Extent of local spread (numeric) 
- surg
- Surgical procedure performed (numeric indicator) 
- node4
- At least 4 nodes positive (numeric indicator) 
- time
- Follow-up time in days (numeric) 
- etype
- Type of event (numeric indicator) 
Details
The dataset name has been kept as 'Colon_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the survival package version 3.8-3
Breast Cancer Prognostic Data (German Breast Cancer Study Group)
Description
This dataset, Gbsg_df, provides prognostic factors for breast cancer patients from the German Breast Cancer Study Group (GBSG). The dataset includes 686 observations and 11 variables, containing information on patient demographics, tumor characteristics, hormone receptor status, and outcomes. Some observations contain missing values.
Usage
data(Gbsg_df)
Format
A data frame with 686 observations and 11 variables:
- pid
- Patient identifier (integer) 
- age
- Age at diagnosis (integer) 
- meno
- Menopausal status (integer indicator) 
- size
- Tumor size in millimeters (integer) 
- grade
- Tumor grade (integer) 
- nodes
- Number of positive lymph nodes (integer) 
- pgr
- Progesterone receptor level (integer) 
- er
- Estrogen receptor level (integer) 
- hormon
- Hormonal therapy received (integer indicator) 
- rfstime
- Relapse-free survival time in days (integer) 
- status
- Patient status (integer indicator) 
Details
The dataset name has been kept as 'Gbsg_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the survival package version 3.8-3
Lead Exposure Data
Description
This dataset, Lead_df, is a data frame comparing control and exposed groups under different hygiene and exposure levels. The dataset consists of 33 observations and 6 variables, including measures of exposure, hygiene, and calculated differences between groups. Some observations may contain missing values.
Usage
data(Lead_df)
Format
A data frame with 33 observations and 6 variables:
- control
- Control group count (integer) 
- exposed
- Exposed group count (integer) 
- level
- Exposure level (factor with 3 levels: "high", "low", "medium") 
- hyg
- Hygiene level (factor with 3 levels: "good", "mod", "poor") 
- both
- Combined exposure and hygiene category (factor with 4 levels, e.g. "high.ok", "high.poor", ...) 
- dif
- Difference between control and exposed (integer) 
Details
The dataset name has been kept as 'Lead_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the evident package version 1.0.4
Mouse Cancer Trial Data
Description
This dataset, Mouse_df, provides data from mouse cancer trials used in studies by Royston and Altman. The dataset includes 181 observations and 4 variables, covering information on treatment assignment, survival time, outcome, and mouse identifiers. Some observations contain missing values.
Usage
data(Mouse_df)
Format
A data frame with 181 observations and 4 variables:
- trt
- Treatment group (factor) 
- days
- Survival time in days (numeric) 
- outcome
- Trial outcome (factor) 
- id
- Mouse identifier (integer) 
Details
The dataset name has been kept as 'Mouse_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the survival package version 3.8-3
Chronic Pain Clinical Trial Data
Description
This dataset, Pain_df, is a data frame containing clinical trial data for chronic pain treatments. The trial compared active treatment versus placebo across different clinical centers and diagnoses. The dataset consists of 193 observations and 4 variables. Some observations may contain missing values.
Usage
data(Pain_df)
Format
A data frame with 193 observations and 4 variables:
- treat
- Treatment group (factor: active vs placebo) 
- response
- Response outcome (factor) 
- center
- Clinical trial center (factor) 
- diagnosis
- Diagnosis category (factor) 
Details
The dataset name has been kept as 'Pain_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the sanon package version 1.6
Periodontal Disease Data
Description
This dataset, Periodontal_df, is a data frame containing information on smoking habits, demographics, and periodontal health indicators. The dataset consists of 882 observations and 12 variables, including smoking frequency, socioeconomic indicators, and periodontal measures. Some observations may contain missing values.
Usage
data(Periodontal_df)
Format
A data frame with 882 observations and 12 variables:
- SEQN
- Sequence identifier (numeric) 
- female
- Sex indicator (numeric) 
- age
- Age in years (numeric) 
- black
- Race indicator for Black participants (numeric) 
- educf
- Education level (ordered factor with 5 levels) 
- income
- Income measure (numeric) 
- cigsperday
- Cigarettes smoked per day (numeric) 
- either
- Count of sites with periodontal disease (integer) 
- neither
- Count of sites without periodontal disease (integer) 
- pcteither
- Percentage of sites with periodontal disease (numeric) 
- z
- Standardized measure (numeric) 
- mset
- Additional periodontal health indicator (numeric) 
Details
The dataset name has been kept as 'Periodontal_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the evident package version 1.0.4
External Control Trial Data for Post-partum Hemorrhage
Description
This dataset, Pph_df, provides data from an external control trial of treatments for post-partum hemorrhage. The dataset includes 802 observations and 17 variables, containing information on blood loss, treatment assignment, demographic characteristics, and educational background. Some observations contain missing values.
Usage
data(Pph_df)
Format
A data frame with 802 observations and 17 variables:
- cum_blood_20m
- Cumulative blood loss at 20 minutes (numeric) 
- tx
- Treatment indicator (numeric) 
- age
- Age of the participant (numeric) 
- no_educ
- Indicator for no formal education (numeric) 
- ...
- Additional variables related to treatment and outcomes (numeric) 
Details
The dataset name has been kept as 'Pph_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the causalOT package version 1.0.2
Respiratory Disorder Clinical Trial Data
Description
This dataset, Resp_df, is a data frame containing repeated measurements from a clinical trial on respiratory disorders under two treatment conditions. The dataset records demographic information (center, sex, age), baseline measures, and follow-up measurements across four visits. It consists of 111 observations and 9 variables. Some observations may contain missing values.
Usage
data(Resp_df)
Format
A data frame with 111 observations and 9 variables:
- center
- Clinical trial center (factor) 
- treatment
- Treatment group (character) 
- sex
- Sex of the participant (character) 
- age
- Age of the participant (integer) 
- baseline
- Baseline measurement (integer) 
- visit1
- Measurement at visit 1 (integer) 
- visit2
- Measurement at visit 2 (integer) 
- visit3
- Measurement at visit 3 (integer) 
- visit4
- Measurement at visit 4 (integer) 
Details
The dataset name has been kept as 'Resp_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the sanon package version 1.6
Breast Cancer Prognostic Data (Rotterdam Study)
Description
This dataset, Rotterdam_df, provides prognostic factors for breast cancer patients used in the studies of Royston and Altman. The dataset includes 2,982 observations and 15 variables, covering patient demographics, tumor characteristics, treatments, and outcomes. Some observations contain missing values.
Usage
data(Rotterdam_df)
Format
A data frame with 2,982 observations and 15 variables:
- pid
- Patient identifier (integer) 
- year
- Year of surgery (integer) 
- age
- Age at diagnosis (integer) 
- meno
- Menopausal status (integer indicator) 
- size
- Tumor size category (factor) 
- grade
- Tumor grade (integer) 
- nodes
- Number of positive lymph nodes (integer) 
- pgr
- Progesterone receptor level (integer) 
- er
- Estrogen receptor level (integer) 
- hormon
- Hormonal therapy received (integer indicator) 
- chemo
- Chemotherapy received (integer indicator) 
- rtime
- Relapse-free survival time in days (numeric) 
- recur
- Recurrence indicator (integer) 
- dtime
- Time to death in days (numeric) 
- death
- Death indicator (integer) 
Details
The dataset name has been kept as 'Rotterdam_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the survival package version 3.8-3
Seborrheic Dermatitis Clinical Trial Data
Description
This dataset, Sebor_df, is a data frame containing clinical trial data on seborrheic dermatitis, comparing test and placebo treatments. It records participant center, treatment assignment, dermatitis scores across three assessments, and severity indicators at the same points. The dataset consists of 167 observations and 8 variables. Some observations may contain missing values.
Usage
data(Sebor_df)
Format
A data frame with 167 observations and 8 variables:
- center
- Clinical trial center (factor) 
- treat
- Treatment group: test or placebo (character) 
- score1
- Dermatitis score at assessment 1 (integer) 
- score2
- Dermatitis score at assessment 2 (integer) 
- score3
- Dermatitis score at assessment 3 (integer) 
- severity1
- Severity indicator at assessment 1 (integer) 
- severity2
- Severity indicator at assessment 2 (integer) 
- severity3
- Severity indicator at assessment 3 (integer) 
Details
The dataset name has been kept as 'Sebor_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the sanon package version 1.6
Skin Condition Clinical Trial Data
Description
This dataset, Skin_df, is a data frame containing clinical trial data on skin conditions, comparing responses under placebo and test treatments. It includes participant center, treatment assignment, disease stage, and responses across three assessments. The dataset consists of 172 observations and 6 variables. Some observations may contain missing values.
Usage
data(Skin_df)
Format
A data frame with 172 observations and 6 variables:
- center
- Clinical trial center (factor) 
- treat
- Treatment group: placebo or test (factor) 
- stage
- Disease stage (integer) 
- res1
- Response at assessment 1 (integer) 
- res2
- Response at assessment 2 (integer) 
- res3
- Response at assessment 3 (integer) 
Details
The dataset name has been kept as 'Skin_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the sanon package version 1.6
Smoking and Homocysteine Data
Description
This dataset, SmokeH_df, is a data frame containing information on smoking, homocysteine levels, demographics, and socioeconomic indicators. The dataset consists of 2,475 observations and 15 variables, including biomarkers, smoking-related measures, age, education, and poverty ratio. Some observations contain missing values.
Usage
data(SmokeH_df)
Format
A data frame with 2,475 observations and 15 variables:
- SEQN
- Participant identifier (integer) 
- homocysteine
- Homocysteine level (numeric) 
- z
- Z score indicator (integer) 
- female
- Sex indicator (integer, 1 = female, 0 = male) 
- age
- Age in years (integer) 
- education
- Education level (integer code) 
- povertyr
- Poverty ratio (numeric) 
- bmi
- Body mass index (numeric) 
- cotinine
- Cotinine level (numeric) 
- st
- Smoking type indicator (integer) 
- stf
- Smoking type (character string) 
- age3
- Age category (integer code) 
- ed3
- Education category (integer code) 
- bmi3
- BMI category (integer code) 
- pov2
- Poverty category (logical) 
Details
The dataset name has been kept as 'SmokeH_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the evident package version 1.0.4
Ischemic Stroke Case-Control Data
Description
This dataset, Stroke_df, contains fictional case-control data for ischemic stroke, including exposures, risk factors, and confounders. The dataset includes 16,623 observations and 21 variables, covering demographic details, lifestyle factors, biomarkers, and comorbidities. Some observations contain missing values.
Usage
data(Stroke_df)
Format
A data frame with 16,623 observations and 21 variables:
- regionnn7
- Geographic region (factor) 
- case
- Case indicator for ischemic stroke (numeric) 
- esex
- Sex of the participant (integer) 
- eage
- Age of the participant (integer) 
- htnadmbp
- Hypertension or blood pressure measure (numeric) 
- nevfcur
- Smoking status (factor) 
- global_stress2
- Perceived stress indicator (factor) 
- whrs2tert
- Waist-to-hip ratio tertiles (factor) 
- phys
- Physical activity indicator (factor) 
- alcohfreqwk
- Weekly alcohol consumption frequency (factor) 
- dmhba1c2
- Diabetes / HbA1c category (factor) 
- cardiacrfcat
- Cardiac risk factor category (factor) 
- ahei3tert
- Alternative Healthy Eating Index tertiles (factor) 
- apob_apoatert
- ApoB/ApoA ratio tertiles (factor) 
- subeduc
- Sub-education level (factor) 
- moteduc
- Mother’s education level (factor) 
- fatduc
- Father’s education level (factor) 
- subhtn
- Sub-hypertension indicator (factor) 
- whr
- Waist-to-hip ratio (numeric) 
- apob_apoa
- ApoB/ApoA continuous ratio (numeric) 
- weights
- Sample weights (numeric) 
Details
The dataset name has been kept as 'Stroke_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the causalPAF package version 1.2.5
Thiamethoxam Application and Crop Yield Data
Description
This dataset, Thiam_df, is a data frame containing information on thiamethoxam applications and crop yield measurements in squash plants. The dataset consists of 165 observations and 11 variables, including treatment types, plant variety, replication, fruit counts, yield measures, and defoliation indicators. Some observations may contain missing values.
Usage
data(Thiam_df)
Format
A data frame with 165 observations and 11 variables:
- trt
- Treatment type (factor) 
- var
- Plant variety (factor) 
- rep
- Replication block (factor) 
- fruit
- Number of fruits (numeric) 
- avg_mass
- Average fruit mass (numeric) 
- mass
- Total fruit mass (numeric) 
- yield
- Crop yield (numeric) 
- visit
- Pollinator visit count (numeric) 
- foliage
- Foliage measure (numeric) 
- scb
- Squash vine borer damage (numeric) 
- defoliation
- Defoliation percentage (numeric) 
Details
The dataset name has been kept as 'Thiam_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the melt package version 1.11.4
Ursodeoxycholic Acid Trial Data
Description
This dataset, Udca_df, contains data from a clinical trial of ursodeoxycholic acid (UDCA). The dataset includes 1,360 observations and 8 variables, covering treatment assignment, disease stage, bilirubin levels, risk scores, follow-up time, and outcomes. Some observations contain missing values.
Usage
data(Udca_df)
Format
A data frame with 1,360 observations and 8 variables:
- id
- Patient identifier (integer) 
- trt
- Treatment group (integer) 
- stage
- Disease stage (integer) 
- bili
- Bilirubin level (numeric) 
- riskscore
- Calculated risk score (numeric) 
- futime
- Follow-up time in days (numeric) 
- status
- Patient status indicator (numeric) 
- endpoint
- Endpoint description (character) 
Details
The dataset name has been kept as 'Udca_df' to avoid confusion with other datasets in the R ecosystem. This naming convention helps distinguish this dataset as part of the ForCausality package and assists users in identifying its specific characteristics. The suffix 'df' indicates that the dataset is a data frame. The original content has not been modified in any way.
Source
Data taken from the survival package version 3.8-3