---
title: "Diagnostic Functions Guide"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Diagnostic Functions Guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(tidyaudit)
library(dplyr)
```

tidyaudit includes tidyverse ports of the diagnostic functions from
[dtaudit](https://github.com/fpcordeiro/dtaudit). These functions help you
understand joins, validate keys, compare tables, diagnose missing values and
string quality, and filter with full visibility.

## Join diagnostics

`validate_join()` analyzes a potential join **without performing it**, reporting
match rates, relationship type, duplicate keys, and unmatched rows.

```{r validate-join}
orders <- data.frame(
  id     = c(1L, 2L, 3L, 3L, 4L, 5L),
  amount = c(100, 200, 150, 175, 300, 50)
)
customers <- data.frame(
  id   = c(2L, 3L, 6L),
  name = c("Alice", "Bob", "Carol")
)

validate_join(orders, customers, by = "id")
```

### Different key names

When the key columns have different names, use a named vector:

```{r validate-join-named}
products <- data.frame(prod_id = 1:3, price = c(10, 20, 30))
sales    <- data.frame(item_id = c(1L, 1L, 2L), qty = c(5, 3, 7))

validate_join(products, sales, by = c("prod_id" = "item_id"))
```

### Stat tracking

Track the impact on a numeric column with `stat` (same column name in both
tables) or `stat_x`/`stat_y` (different column names):

```{r validate-join-stat}
x <- data.frame(id = 1:4, revenue = c(100, 200, 300, 400))
y <- data.frame(id = c(2L, 3L, 5L), cost = c(10, 20, 30))

validate_join(x, y, by = "id", stat_x = "revenue", stat_y = "cost")
```

## Key validation

### Primary keys

`validate_primary_keys()` tests whether a set of columns uniquely identify every
row:

```{r validate-pk}
df <- data.frame(
  id    = c(1L, 2L, 3L, 3L, 4L),
  group = c("A", "A", "B", "C", "A"),
  value = c(10, 20, 30, 40, 50)
)

# Single column — not unique
validate_primary_keys(df, "id")

# Composite key — unique
validate_primary_keys(df, c("id", "group"))
```

### Variable relationships

`validate_var_relationship()` determines the relationship between two columns:

```{r validate-var-rel}
df2 <- data.frame(
  dept    = c("Sales", "Sales", "Engineering", "Engineering"),
  manager = c("Ann", "Ann", "Bob", "Bob")
)
validate_var_relationship(df2, "dept", "manager")
```

## Table comparison

`compare_tables()` compares two data.frames by examining columns, row counts,
key overlap, and numeric discrepancies:

```{r compare-tables}
before <- data.frame(id = 1:5, value = c(10.0, 20.0, 30.0, 40.0, 50.0))
after  <- data.frame(id = 1:5, value = c(10.0, 22.5, 30.0, 40.0, 55.0))

compare_tables(before, after)
```

## Filter diagnostics

`filter_keep()` and `filter_drop()` filter data while printing diagnostics about
what was removed.

### filter_keep

Keeps rows where the condition is `TRUE` (same as `dplyr::filter()`):

```{r filter-keep}
sales <- data.frame(
  id     = 1:10,
  amount = c(500, 25, 1200, 80, 3000, 15, 750, 40, 2000, 60),
  status = rep(c("valid", "suspect"), 5)
)

result <- filter_keep(sales, amount > 100, .stat = amount)
```

### filter_drop

Drops rows where the condition is `TRUE` (the inverse):

```{r filter-drop}
result2 <- filter_drop(sales, status == "suspect", .stat = amount)
```

### Warning thresholds

Set `.warn_threshold` to get a warning when too many rows are dropped:
```{r filter-warn, warning=TRUE}
filter_keep(sales, amount > 1000, .stat = amount, .warn_threshold = 0.5)
```

## Data quality

### Missing value diagnosis

`diagnose_nas()` reports NA counts and percentages for every column:

```{r diagnose-nas}
messy <- data.frame(
  id    = 1:6,
  name  = c("A", NA, "C", "D", NA, "F"),
  score = c(10, 20, NA, NA, 50, NA),
  grade = c("A", "B", "C", NA, "A", "B")
)

diagnose_nas(messy)
```

### Column summaries

`summarize_column()` gives type-appropriate statistics for a single vector:

```{r summarize-column}
summarize_column(c(1, 2, 3, NA, 5, 10, 100))
summarize_column(c("apple", "banana", "apple", "cherry", NA))
```

`get_summary_table()` applies this to all columns (or selected ones):

```{r get-summary-table}
get_summary_table(messy)
```

## String cleaning

These functions require the **stringi** package (listed in Suggests).

### diagnose_strings

`diagnose_strings()` audits a character vector for common quality issues:

```{r diagnose-strings}
firms <- c("Apple", "APPLE", "apple", "  Microsoft ", "Google", NA, "")
diagnose_strings(firms)
```

### audit_transform

`audit_transform()` shows exactly what a transformation function changes:

```{r audit-transform}
audit_transform(firms, trimws)
audit_transform(firms, tolower)
```