--- title: "EpidigiR: Digital Epidemiological Analysis and Visualization Tools" author: "Esther Atsabina Wanjala" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{EpidigiR: Digital Epidemiological Analysis and Visualization Tools} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction to EpidigiR: Epidemiological Analysis and Visualization EpidigiR is an R package for epidemiological analysis, modeling, and visualization... EpidigiR is an R package for epidemiological analysis, modeling, and visualization, designed with minimal dependencies and comprehensive functionality. It provides three main functions to cover 12 epidemiological topics, including a digital epidemiology aspect that leverages real-time data integration and advanced computational techniques to enhance disease tracking and prediction. - **epi_analyze**: Performs summary statistics, SIR modeling, DALY calculations, age standardization, diagnostic test evaluation, and NLP keyword extraction. - **epi_model**: Handles clinical trial power calculation, survival analysis, SNP association, logistic regression, k-means clustering, Random Forest, and SVM. - **epi_visualize**: Creates visualizations for prevalence mapping, epidemic curves, scatter plots, and boxplots. The package includes nine datasets to support these analyses: epi_prevalence, sir_data, geno_data, ml_data, nlp_data, clinical_data, daly_data, survey_data, diagnostic_data, and survival_data. This vignette demonstrates how to use these functions and datasets for various epidemiological tasks. ## Setup ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) # Required packages required_packages <- c( "deSolve", "sp", "tm", "glmnet", "caret", "kernlab", "survival", "randomForest", "EpidigiR" ) missing_pkgs <- required_packages[!vapply(required_packages, requireNamespace, logical(1L), quietly = TRUE)] if (length(missing_pkgs) > 0) { message("Missing packages: ", paste(missing_pkgs, collapse = ", ")) message("Install them manually before running this vignette: install.packages(missing_pkgs)") } for (pkg in required_packages) { if (requireNamespace(pkg, quietly = TRUE)) { suppressPackageStartupMessages(library(pkg, character.only = TRUE)) } } # Prepare datasets if (exists("ml_data")) { ml_data$outcome <- as.factor(ml_data$outcome) } if (exists("clinical_data")) { clinical_data$outcome <- as.factor(clinical_data$outcome) } ``` ## Datasets The package includes the following datasets: - **epi_prevalence**: Disease prevalence by region and age group, with spatial coordinates (12 rows). - **sir_data**: Simulated SIR model output (50 rows). - **geno_data**: Genotype and case-control data for SNP analysis (100 rows). - **ml_data**: Patient data for machine learning (logistic regression, clustering, Random Forest, SVM; 100 rows). - **nlp_data**: Epidemiological text data for NLP (100 rows). - **clinical_data**: Clinical trial data for power calculations and outcome analysis (200 rows). - **daly_data**: Data for DALY calculations (20 rows). - **survey_data**: Data for age standardization (20 rows). - **diagnostic_data**: Data for diagnostic test evaluation (10 rows). - **survival_data**: Data for survival analysis (100 rows). ## Examples ## Summary Statistics ```{r} data(epi_prevalence) result <- epi_analyze( epi_prevalence, outcome = "cases", population = "population", group = "region", type = "summary" ) print(result) ``` ## SIR Epidemic Model ```{r} sir_result <- epi_analyze( data = NULL, outcome = NULL, type = "sir", N = 1000, beta = 0.3, gamma = 0.1, days = 50 ) epi_visualize(sir_result, x = "time", y = "Infected", type = "curve", main = "Epidemic Curve") ``` ## Spatial map ```{r} data(epi_prevalence) coordinates(epi_prevalence) <- ~lon + lat epi_visualize(epi_prevalence, x = "prevalence", type = "map", main = "Prevalence Map") ``` ## Logistic Model ```{r} data(clinical_data) clinical_data$outcome <- as.factor(clinical_data$outcome) model <- epi_model(clinical_data, formula = outcome ~ age + health_score + dose, type = "logistic") head(model$predictions) ``` ## Random Forest with Clinical Data ```{r} rf_model <- epi_model(clinical_data, formula = outcome ~ age + health_score + dose, type = "rf") head(rf_model$predictions) ``` ## Global Health Burden (DALY) ```{r} data(daly_data) epi_analyze(daly_data, outcome = NULL, type = "daly") ``` ## SNP Association ```{r} data(geno_data) epi_model(geno_data, formula = outcome ~ snp1 + snp2, type = "snp") ``` ## Age Standardization ```{r} data(survey_data) epi_analyze(survey_data, outcome = NULL, type = "age_standardize") ``` ## Machine-learning-logistic ```{r} data(ml_data) epi_model(ml_data, formula = outcome ~ age + exposure + genetic_risk, type = "logistic") ``` ## Survival Analysis Perform survival analysis using survival_data. ```{r} data(survival_data) epi_model(survival_data, type = "survival") ``` ## NLP-keyword Extraction ```{r} data(nlp_data) nlp_result <- epi_analyze(nlp_data, outcome = NULL, population = NULL, type = "nlp", n = 5) head(nlp_result) ``` ### K-means Clustering ```{r} data(ml_data) epi_model(ml_data[, c("age", "exposure", "genetic_risk")], type = "kmeans", k = 3) ``` ## SVM-Modelling ```{r} data(ml_data) ml_data$outcome <- as.factor(ml_data$outcome) svm_model <- epi_model(ml_data, formula = outcome ~ age + exposure + genetic_risk, type = "svmRadial") svm_model$performance ``` ## Diagnostic Tests ```{r} data(diagnostic_data) epi_analyze(diagnostic_data, outcome = NULL, type = "diagnostic") ``` ## boxplot-visualization ```{r} data(clinical_data) epi_visualize(clinical_data, x = "arm", y = "outcome", type = "boxplot", main = "Outcome by Treatment Arm") ``` ## Scatter-visualization ```{r} data(ml_data) epi_visualize(ml_data, x = "age", y = "outcome", type = "scatter", main = "Age vs. Disease Outcome") ``` ## Conclusion EpidigiR offers a streamlined yet powerful toolkit for epidemiological analysis, featuring three key functions—epi_analyze, epi_model, and epi_visualize—and nine datasets that address all major topics. These tools support a range of analyses, from SIR modeling to sophisticated machine learning methods such as Random Forest and SVM. Furthermore, it integrates a digital epidemiology component, utilizing real-time data and advanced computational approaches to improve disease monitoring and forecasting, providing a valuable resource for researchers and analysts. ## License EpidigiR is released under the MIT License © 2025 Esther Atsabina Wanjala.