--- title: "Diagnostic Functions Guide" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Diagnostic Functions Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(tidyaudit) library(dplyr) ``` tidyaudit includes tidyverse ports of the diagnostic functions from [dtaudit](https://github.com/fpcordeiro/dtaudit). These functions help you understand joins, validate keys, compare tables, diagnose missing values and string quality, and filter with full visibility. ## Join diagnostics `validate_join()` analyzes a potential join **without performing it**, reporting match rates, relationship type, duplicate keys, and unmatched rows. ```{r validate-join} orders <- data.frame( id = c(1L, 2L, 3L, 3L, 4L, 5L), amount = c(100, 200, 150, 175, 300, 50) ) customers <- data.frame( id = c(2L, 3L, 6L), name = c("Alice", "Bob", "Carol") ) validate_join(orders, customers, by = "id") ``` ### Different key names When the key columns have different names, use a named vector: ```{r validate-join-named} products <- data.frame(prod_id = 1:3, price = c(10, 20, 30)) sales <- data.frame(item_id = c(1L, 1L, 2L), qty = c(5, 3, 7)) validate_join(products, sales, by = c("prod_id" = "item_id")) ``` ### Stat tracking Track the impact on a numeric column with `stat` (same column name in both tables) or `stat_x`/`stat_y` (different column names): ```{r validate-join-stat} x <- data.frame(id = 1:4, revenue = c(100, 200, 300, 400)) y <- data.frame(id = c(2L, 3L, 5L), cost = c(10, 20, 30)) validate_join(x, y, by = "id", stat_x = "revenue", stat_y = "cost") ``` ## Key validation ### Primary keys `validate_primary_keys()` tests whether a set of columns uniquely identify every row: ```{r validate-pk} df <- data.frame( id = c(1L, 2L, 3L, 3L, 4L), group = c("A", "A", "B", "C", "A"), value = c(10, 20, 30, 40, 50) ) # Single column — not unique validate_primary_keys(df, "id") # Composite key — unique validate_primary_keys(df, c("id", "group")) ``` ### Variable relationships `validate_var_relationship()` determines the relationship between two columns: ```{r validate-var-rel} df2 <- data.frame( dept = c("Sales", "Sales", "Engineering", "Engineering"), manager = c("Ann", "Ann", "Bob", "Bob") ) validate_var_relationship(df2, "dept", "manager") ``` ## Table comparison `compare_tables()` compares two data.frames by examining columns, row counts, key overlap, and numeric discrepancies: ```{r compare-tables} before <- data.frame(id = 1:5, value = c(10.0, 20.0, 30.0, 40.0, 50.0)) after <- data.frame(id = 1:5, value = c(10.0, 22.5, 30.0, 40.0, 55.0)) compare_tables(before, after) ``` ## Filter diagnostics `filter_keep()` and `filter_drop()` filter data while printing diagnostics about what was removed. ### filter_keep Keeps rows where the condition is `TRUE` (same as `dplyr::filter()`): ```{r filter-keep} sales <- data.frame( id = 1:10, amount = c(500, 25, 1200, 80, 3000, 15, 750, 40, 2000, 60), status = rep(c("valid", "suspect"), 5) ) result <- filter_keep(sales, amount > 100, .stat = amount) ``` ### filter_drop Drops rows where the condition is `TRUE` (the inverse): ```{r filter-drop} result2 <- filter_drop(sales, status == "suspect", .stat = amount) ``` ### Warning thresholds Set `.warn_threshold` to get a warning when too many rows are dropped: ```{r filter-warn, warning=TRUE} filter_keep(sales, amount > 1000, .stat = amount, .warn_threshold = 0.5) ``` ## Data quality ### Missing value diagnosis `diagnose_nas()` reports NA counts and percentages for every column: ```{r diagnose-nas} messy <- data.frame( id = 1:6, name = c("A", NA, "C", "D", NA, "F"), score = c(10, 20, NA, NA, 50, NA), grade = c("A", "B", "C", NA, "A", "B") ) diagnose_nas(messy) ``` ### Column summaries `summarize_column()` gives type-appropriate statistics for a single vector: ```{r summarize-column} summarize_column(c(1, 2, 3, NA, 5, 10, 100)) summarize_column(c("apple", "banana", "apple", "cherry", NA)) ``` `get_summary_table()` applies this to all columns (or selected ones): ```{r get-summary-table} get_summary_table(messy) ``` ## String cleaning These functions require the **stringi** package (listed in Suggests). ### diagnose_strings `diagnose_strings()` audits a character vector for common quality issues: ```{r diagnose-strings} firms <- c("Apple", "APPLE", "apple", " Microsoft ", "Google", NA, "") diagnose_strings(firms) ``` ### audit_transform `audit_transform()` shows exactly what a transformation function changes: ```{r audit-transform} audit_transform(firms, trimws) audit_transform(firms, tolower) ```