Diagnostic Functions Guide

library(tidyaudit)
library(dplyr)

Before you build a pipeline, you need to understand your data. Before you trust a join, you need to verify the keys. Before you filter, you want to know what you’ll lose. tidyaudit’s diagnostic functions help with the detective work that comes before, during, and after every analysis.

These functions are designed for interactive use — the questions you ask in the console while exploring a dataset or debugging a transformation. They complement the audit trail system, which instruments production pipelines.

Before you join

You’re about to left-join orders to customers. Will every order match? Will the join create duplicates? How many rows will end up with NAs?

validate_join() analyzes a potential join without performing it, reporting match rates, relationship type, duplicate keys, and unmatched rows:

orders <- data.frame(
  id     = c(1L, 2L, 3L, 3L, 4L, 5L),
  amount = c(100, 200, 150, 175, 300, 50)
)
customers <- data.frame(
  id   = c(2L, 3L, 6L),
  name = c("Alice", "Bob", "Carol")
)

validate_join(orders, customers, by = "id")
#> 
#> ── Join Validation: orders ↔ customers ─────────────────────────────────────────
#> Keys in orders: id
#> Keys in customers: id
#> 
#>   Item                                               Value
#>   ───────────────────────────────────────────  ───────────
#>   Relationship                                 many-to-one
#>   Key(s) in orders   [id]                          (1 col)
#>   Key(s) in customers   [id]                       (1 col)
#>   Rows in orders                                         6
#>   Distinct key combos in orders                          5
#>   Rows in customers                                      3
#>   Distinct key combos in customers                       3
#>   Overlapping distinct key combos                        2
#>   Matched row pairs (cartesian)                          3
#>   Match rate from orders                            50.00%
#>   Match rate from customers                         66.67%
#>   Rows only in orders (no match in customers)            3
#>   Rows only in customers (no match in orders)            1
#> 
#> Duplicates: orders=yes customers=no

Different key names

When the key columns have different names, use a named vector:

products <- data.frame(prod_id = 1:3, price = c(10, 20, 30))
sales    <- data.frame(item_id = c(1L, 1L, 2L), qty = c(5, 3, 7))

validate_join(products, sales, by = c("prod_id" = "item_id"))
#> 
#> ── Join Validation: products ↔ sales ───────────────────────────────────────────
#> Keys in products: prod_id
#> Keys in sales: item_id
#> 
#>   Item                                             Value
#>   ─────────────────────────────────────────  ───────────
#>   Relationship                               one-to-many
#>   Key(s) in products   [prod_id]                 (1 col)
#>   Key(s) in sales   [item_id]                    (1 col)
#>   Rows in products                                     3
#>   Distinct key combos in products                      3
#>   Rows in sales                                        3
#>   Distinct key combos in sales                         2
#>   Overlapping distinct key combos                      2
#>   Matched row pairs (cartesian)                        3
#>   Match rate from products                        66.67%
#>   Match rate from sales                          100.00%
#>   Rows only in products (no match in sales)            1
#>   Rows only in sales (no match in products)            0
#> 
#> Duplicates: products=no sales=yes

Tracking the impact on a metric

Want to know the revenue at risk from unmatched keys? Use stat (same column name in both tables) or stat_x/stat_y (different column names):

x <- data.frame(id = 1:4, revenue = c(100, 200, 300, 400))
y <- data.frame(id = c(2L, 3L, 5L), cost = c(10, 20, 30))

validate_join(x, y, by = "id", stat_x = "revenue", stat_y = "cost")
#> 
#> ── Join Validation: x ↔ y ──────────────────────────────────────────────────────
#> Keys in x: id
#> Keys in y: id
#> 
#>   Item                                  Value
#>   ───────────────────────────────  ──────────
#>   Relationship                     one-to-one
#>   Key(s) in x   [id]                  (1 col)
#>   Key(s) in y   [id]                  (1 col)
#>   Rows in x                                 4
#>   Distinct key combos in x                  4
#>   Rows in y                                 3
#>   Distinct key combos in y                  3
#>   Overlapping distinct key combos           2
#>   Matched row pairs (cartesian)             2
#>   Match rate from x                    50.00%
#>   Match rate from y                    66.67%
#>   Rows only in x (no match in y)            2
#>   Rows only in y (no match in x)            1
#> 
#> ── Stat diagnostics ────────────────────────────────────────────────────────────
#> 
#> revenue in x:
#> • Total: 1,000
#> • Matched: 500 (50.00%)
#> • Unmatched: 500 (50.00%)
#> 
#> cost in y:
#> • Total: 60
#> • Matched: 30 (50.00%)
#> • Unmatched: 30 (50.00%)
#> 
#> Duplicates: x=no y=no

Are your keys unique?

You assume id uniquely identifies every row. Does it?

validate_primary_keys() tests whether a set of columns forms a valid primary key:

df <- data.frame(
  id    = c(1L, 2L, 3L, 3L, 4L),
  group = c("A", "A", "B", "C", "A"),
  value = c(10, 20, 30, 40, 50)
)

# Single column -- not unique
validate_primary_keys(df, "id")
#> 
#> ── Primary Key Validation ──────────────────────────────────────────────────────
#> Table: df
#> Key column: id
#> 
#>   Metric                   Value
#>   ───────────────────────  ─────
#>   Total rows                   5
#>   Unique key combinations      4
#>   Duplicate key combos         1
#> 
#> ✖ NO - Keys do NOT uniquely identify all rows.
#> 
#> Duplicate keys (showing up to 10):
#>   id n
#> 1  3 2

# Composite key -- unique
validate_primary_keys(df, c("id", "group"))
#> 
#> ── Primary Key Validation ──────────────────────────────────────────────────────
#> Table: df
#> Key columns: id and group
#> 
#>   Metric                   Value
#>   ───────────────────────  ─────
#>   Total rows                   5
#>   Unique key combinations      5
#>   Duplicate key combos         0
#> 
#> ✔ YES - Keys uniquely identify all rows.

Beyond uniqueness: what’s the relationship?

validate_var_relationship() determines the functional relationship between two columns — one-to-one, one-to-many, many-to-one, or many-to-many:

df2 <- data.frame(
  dept    = c("Sales", "Sales", "Engineering", "Engineering"),
  manager = c("Ann", "Ann", "Bob", "Bob")
)
validate_var_relationship(df2, "dept", "manager")
#> 
#> ── Variable Relationship Validation ────────────────────────────────────────────
#> Table: df2
#> Variables: dept ↔ manager
#> 
#>   Metric                        Value
#>   ────────────────────────────  ─────
#>   Unique values in dept             2
#>   Unique values in manager          2
#>   Unique (dept, manager) pairs      2
#> 
#> dept → manager: one-to-one
#> manager → dept: one-to-one
#> 
#> Relationship: ONE-TO-ONE

What changed between versions?

You have yesterday’s extract and today’s extract. Or the data before your transformation and after. What’s different?

compare_tables() compares two data frames by examining columns, row counts, key overlap, and numeric discrepancies:

before <- data.frame(id = 1:5, value = c(10.0, 20.0, 30.0, 40.0, 50.0))
after  <- data.frame(id = 1:5, value = c(10.0, 22.5, 30.0, 40.0, 55.0))

compare_tables(before, after)
#> 
#> ── Table Comparison: before vs after ───────────────────────────────────────────
#> 1. Row counts
#> before: 5 rows
#> after: 5 rows
#> Difference: =
#> 
#> 2. Column names
#> Matching columns: 2
#> Only in before: 0
#> Only in after: 0
#> Type mismatches: 0
#> 
#> 3. Key columns
#> Key columns: id (auto-detected)
#> Distinct combos in before: 5
#> Distinct combos in after: 5
#> 
#> 4. Row matching
#> Only in before: 0
#> Only in after: 0
#> Matched, no discrepancies: 3 (60%)
#> Matched, with discrepancies: 2 (40%)
#> Total cell discrepancies: 2 (2 numeric, 0 categorical)
#> 
#> 5. Numeric discrepancies (absolute differences)
#> Compared after merging on keys.
#> Rows matched: 5
#> 
#>     Column  N  Min  Q25  Median  Q75  Max  >tol
#>     ──────  ─  ───  ───  ──────  ───  ───  ────
#>     value   5    0    0       0  2.5    5     2
#> 
#> Top discrepancies:
#>     id  column  value_x  value_y  abs_diff  pct_diff
#>     ──  ──────  ───────  ───────  ────────  ────────
#>     5   value        50     55.0       5.0      9.1%
#>     2   value        20     22.5       2.5     11.1%
#> 
#> 6. Categorical discrepancies
#> No categorical discrepancies found.

Filter with visibility

dplyr::filter() silently removes rows. You apply the condition, the rows vanish, and you move on. filter_keep() and filter_drop() do the same filtering but tell you exactly what was removed — and optionally warn you when too much is gone.

filter_keep

Keeps rows where the condition is TRUE (same as dplyr::filter()), with diagnostic output:

sales <- data.frame(
  id     = 1:10,
  amount = c(500, 25, 1200, 80, 3000, 15, 750, 40, 2000, 60),
  status = rep(c("valid", "suspect"), 5)
)

result <- filter_keep(sales, amount > 100, .stat = amount)
#> filter_keep(sales, amount > 100)
#> Dropped 5 of 10 rows (50.00%).
#> Dropped 220 of 7,670 for amount (2.87%).

filter_drop

Drops rows where the condition is TRUE (the inverse):

result2 <- filter_drop(sales, status == "suspect", .stat = amount)
#> filter_drop(sales, status == "suspect")
#> Dropped 5 of 10 rows (50.00%).
#> Dropped 220 of 7,670 for amount (2.87%).

Warning thresholds

Set .warn_threshold to get a warning when too many rows are dropped — a safety net for production pipelines:

filter_keep(sales, amount > 1000, .stat = amount, .warn_threshold = 0.5)
#> filter_keep(sales, amount > 1000)
#> Dropped 7 of 10 rows (70.00%).
#> Dropped 1,470 of 7,670 for amount (19.17%).
#> Warning: Dropped 70.0% of rows exceeds threshold (50.0%).
#>   id amount status
#> 1  3   1200  valid
#> 2  5   3000  valid
#> 3  9   2000  valid

Diagnose data quality

You just received a dataset. Where are the gaps?

Missing values

diagnose_nas() reports NA counts and percentages for every column:

messy <- data.frame(
  id    = 1:6,
  name  = c("A", NA, "C", "D", NA, "F"),
  score = c(10, 20, NA, NA, 50, NA),
  grade = c("A", "B", "C", NA, "A", "B")
)

diagnose_nas(messy)
#> 
#> ── Missing Value Diagnosis ─────────────────────────────────────────────────────
#> 3 of 4 columns have missing values
#> 
#>   Variable  N NA  Pct NA
#>   ────────  ────  ──────
#>   score        3   50.0%
#>   name         2   33.3%
#>   grade        1   16.7%

Column summaries

summarize_column() gives type-appropriate statistics for a single vector — quantiles for numeric, value counts for character, balance for logical:

summarize_column(c(1, 2, 3, NA, 5, 10, 100))
#>                type            n_unique             missing       missing_share 
#>           "numeric"                 "6"                 "1" "0.142857142857143" 
#>       most_frequent                mean                  sd                 min 
#>                  NA  "20.1666666666667"  "39.2398606861271"                 "1" 
#>                 q25                 q50                 q75                 max 
#>              "2.25"                 "4"              "8.75"               "100" 
#>            example1            example2            example3 
#>                 "1"                 "2"                 "3"
summarize_column(c("apple", "banana", "apple", "cherry", NA))
#>          type      n_unique       missing missing_share most_frequent 
#>   "character"           "3"           "1"         "0.2"       "apple" 
#>          mean            sd           min           q25           q50 
#>            NA            NA       "apple"            NA            NA 
#>           q75           max      example1      example2      example3 
#>            NA      "cherry"       "apple"      "banana"       "apple"

get_summary_table() applies this across all columns (or a selection):

get_summary_table(messy)
#>   variable      type n_unique missing     missing_share most_frequent
#> 1       id   numeric        6       0                 0          <NA>
#> 2     name character        4       2 0.333333333333333             A
#> 3    score   numeric        3       3               0.5          <NA>
#> 4    grade character        3       1 0.166666666666667             A
#>               mean               sd min  q25  q50  q75 max example1 example2
#> 1              3.5 1.87082869338697   1 2.25  3.5 4.75   6        1        2
#> 2             <NA>             <NA>   A <NA> <NA> <NA>   F        A        C
#> 3 26.6666666666667 20.8166599946613  10   15   20   35  50       10       20
#> 4             <NA>             <NA>   A <NA> <NA> <NA>   C        A        B
#>   example3
#> 1        3
#> 2        D
#> 3       50
#> 4        C

Frequency tables

Base R’s table() gives you raw counts. tab() gives you percentages, cumulative totals, sorting, cutoffs, weighting, and two-way crosstabulations.

One-way tables

tab(mtcars, cyl)
#> 
#> ── Tabulation: cyl ─────────────────────────────────────────────────────────────
#> 32 observations
#> 
#>   Value  Freq  Percent    Cum.
#>   ─────  ────  ───────  ──────
#>   4        11    34.4%   34.4%
#>   6         7    21.9%   56.3%
#>   8        14    43.8%  100.0%
#>   ─────  ────  ───────  ──────
#>   Total    32   100.0%

Sorting and cutoffs

# Sort by frequency
tab(mtcars, carb, .sort = "freq_desc")
#> 
#> ── Tabulation: carb ────────────────────────────────────────────────────────────
#> 32 observations
#> 
#>   Value  Freq  Percent   Cum.
#>   ─────  ────  ───────  ─────
#>   2        10    31.2%  31.2%
#>   4        10    31.2%  62.4%
#>   1         7    21.9%  84.3%
#>   3         3     9.4%  93.7%
#>   6         1     3.1%  96.8%
#>   8         1     3.1%  99.9%
#>   ─────  ────  ───────  ─────
#>   Total    32   100.0%

# Keep only top-2 values, collapse rest into (Other)
tab(mtcars, carb, .cutoff = 2)
#> 
#> ── Tabulation: carb ────────────────────────────────────────────────────────────
#> 32 observations
#> 
#>   Value    Freq  Percent   Cum.
#>   ───────  ────  ───────  ─────
#>   2          10    31.2%  31.2%
#>   4          10    31.2%  62.4%
#>   (Other)    12    37.5%  99.9%
#>   ───────  ────  ───────  ─────
#>   Total      32   100.0%
#> 
#> (Other) collapses 4 values below cutoff

Two-way crosstabulations

tab(mtcars, cyl, gear)
#> 
#> ── Crosstabulation: cyl × gear ─────────────────────────────────────────────────
#> 32 observations | Cell contents: count
#> 
#>   cyl     3   4  5  Total
#>   ─────  ──  ──  ─  ─────
#>   4       1   8  2     11
#>   6       2   4  1      7
#>   8      12   0  2     14
#>   ─────  ──  ──  ─  ─────
#>   Total  15  12  5     32

# Show row percentages instead of counts
tab(mtcars, cyl, gear, .display = "row_pct")
#> 
#> ── Crosstabulation: cyl × gear ─────────────────────────────────────────────────
#> 32 observations | Cell contents: row %
#> 
#>   cyl        3      4      5  Total
#>   ─────  ─────  ─────  ─────  ─────
#>   4       9.1%  72.7%  18.2%     11
#>   6      28.6%  57.1%  14.3%      7
#>   8      85.7%   0.0%  14.3%     14
#>   ─────  ─────  ─────  ─────  ─────
#>   Total     15     12      5     32

Weighted tabulation

tab(mtcars, cyl, .wt = mpg)
#> 
#> ── Tabulation: cyl (weighted by mpg) ───────────────────────────────────────────
#> 32 observations
#> 
#>   Value   Freq  Percent    Cum.
#>   ─────  ─────  ───────  ──────
#>   4      293.3    45.6%   45.6%
#>   6      138.2    21.5%   67.1%
#>   8      211.4    32.9%  100.0%
#>   ─────  ─────  ───────  ──────
#>   Total  642.9   100.0%

String quality

Duplicates hiding behind case differences, leading spaces, and encoding issues are among the most common causes of failed joins and inflated group counts.

diagnose_strings() audits a character vector for these problems:

firms <- c("Apple", "APPLE", "apple", "  Microsoft ", "Google", NA, "")
diagnose_strings(firms)
#> 
#> ── String Column Diagnosis: firms ──────────────────────────────────────────────
#> Total elements: 7
#> 
#> Missing & Empty:
#> • NA values: 1 (14.3%)
#> • Empty strings: 1 (14.3%)
#> • Whitespace-only: 0 (0.0%)
#> 
#> Whitespace Issues:
#> • Leading whitespace: 1
#> • Trailing whitespace: 1
#> 
#> Encoding:
#> • Non-ASCII chars: 0
#> 
#> Case Inconsistencies:
#> • Variant groups: 1
#> • Total variants: 3
#> 
#> Case variant examples (up to 5 groups):
#>  lower n_variants            examples
#>  apple          3 Apple, APPLE, apple

Audit your transformations

You’re about to apply trimws() to a column, or round() to a price vector. How many values will actually change? What will they look like afterward?

audit_transform() shows exactly what a transformation does. It automatically detects the vector type and computes type-appropriate diagnostics:

Character vectors

audit_transform(firms, trimws)
#> 
#> ── Transformation Audit [character]: firms ─────────────────────────────────────
#> Function: trimws
#> 
#>   Metric                       Value
#>   ──────────────  ──────────────────
#>   Total elements                   7
#>   NA (before)                      1
#>   NA (after)                       1
#>   Changed         1 (14.3% of total)
#>   Unchanged                        6
#> 
#> Examples of changes (showing 1 of 1):
#>        before     after
#>    Microsoft  Microsoft
#> 
#> Access cleaned vector with: `result$cleaned`
audit_transform(firms, tolower)
#> 
#> ── Transformation Audit [character]: firms ─────────────────────────────────────
#> Function: tolower
#> 
#>   Metric                       Value
#>   ──────────────  ──────────────────
#>   Total elements                   7
#>   NA (before)                      1
#>   NA (after)                       1
#>   Changed         4 (57.1% of total)
#>   Unchanged                        3
#> 
#> Examples of changes (showing 4 of 4):
#>        before        after
#>         Apple        apple
#>         APPLE        apple
#>    Microsoft    microsoft 
#>        Google       google
#> 
#> Access cleaned vector with: `result$cleaned`

Numeric vectors

prices <- c(10.456, 20.789, 30.123, NA, 50.999)
audit_transform(prices, round)
#> 
#> ── Transformation Audit [numeric]: prices ──────────────────────────────────────
#> Function: round
#> 
#>   Metric                       Value
#>   ──────────────  ──────────────────
#>   Total elements                   5
#>   NA (before)                      1
#>   NA (after)                       1
#>   Changed         4 (80.0% of total)
#>   Unchanged                        1
#> 
#> Numeric summary:
#>   Metric  Before  After
#>   ──────  ──────  ─────
#>   Mean     28.09     28
#>   SD       17.26  17.38
#>   Min      10.46     10
#>   Median   25.46   25.5
#>   Max         51     51
#>   NaN          0      0
#>   Inf          0      0
#> 
#> Mean absolute delta: 0.1977
#> Changed beyond tolerance: 4 (100.0%)
#> 
#> Examples of changes (showing 4 of 4):
#>  before after
#>  10.456    10
#>  20.789    21
#>  30.123    30
#>  50.999    51
#> 
#> Access cleaned vector with: `result$cleaned`

Date vectors

dates <- as.Date(c("2024-01-15", "2024-06-30", "2024-12-01", NA))
audit_transform(dates, function(d) d + 30)
#> 
#> ── Transformation Audit [Date]: dates ──────────────────────────────────────────
#> Function: function(d) d + 30
#> 
#>   Metric                       Value
#>   ──────────────  ──────────────────
#>   Total elements                   4
#>   NA (before)                      1
#>   NA (after)                       1
#>   Changed         3 (75.0% of total)
#>   Unchanged                        1
#> 
#> Date range:
#>   Metric           Before       After
#>   ───────────  ──────────  ──────────
#>   Min          2024-01-15  2024-02-14
#>   Max          2024-12-01  2024-12-31
#>   Span (days)       321.0       321.0
#> 
#> Examples of changes (showing 3 of 3):
#>      before      after
#>  2024-01-15 2024-02-14
#>  2024-06-30 2024-07-30
#>  2024-12-01 2024-12-31
#> 
#> Access cleaned vector with: `result$cleaned`

Factor vectors

sizes <- factor(c("S", "M", "L", "XL", "XXL", "S", "M"))
audit_transform(sizes, function(f) {
  levels(f)[levels(f) %in% c("XL", "XXL")] <- "XL+"
  f
})
#> 
#> ── Transformation Audit [factor]: sizes ────────────────────────────────────────
#> Function: function(f) {, levels(f)[levels(f) %in% c("XL", "XXL")] <- "XL+", f,
#> and }
#> 
#>   Metric                       Value
#>   ──────────────  ──────────────────
#>   Total elements                   7
#>   NA (before)                      0
#>   NA (after)                       0
#>   Changed         2 (28.6% of total)
#>   Unchanged                        5
#> 
#> Factor levels:
#> • Levels added: XL+
#> • Levels removed: XL, XXL
#> 
#> Level counts (before | after):
#>   Level  Before  After
#>   ─────  ──────  ─────
#>   <NA>        0      0
#>   L           1      1
#>   M           2      2
#>   S           2      2
#>   XL          1      0
#>   XL+         0      2
#>   XXL         1      0
#> 
#> Examples of changes (showing 2 of 2):
#>  before after
#>      XL   XL+
#>     XXL   XL+
#> 
#> Access cleaned vector with: `result$cleaned`

Logical vectors

flags <- c(TRUE, FALSE, TRUE, NA, FALSE)
audit_transform(flags, function(x) !x)
#> 
#> ── Transformation Audit [logical]: flags ───────────────────────────────────────
#> Function: function(x) !x
#> 
#>   Metric                       Value
#>   ──────────────  ──────────────────
#>   Total elements                   5
#>   NA (before)                      1
#>   NA (after)                       1
#>   Changed         4 (80.0% of total)
#>   Unchanged                        1
#> 
#> Logical balance:
#>   Value  Before  After
#>   ─────  ──────  ─────
#>   TRUE        2      2
#>   FALSE       2      2
#>   NA          1      1
#> 
#> Examples of changes (showing 4 of 4):
#>  before after
#>    TRUE FALSE
#>   FALSE  TRUE
#>    TRUE FALSE
#>   FALSE  TRUE
#> 
#> Access cleaned vector with: `result$cleaned`