---
title: "Automated Join Path Discovery"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Automated Join Path Discovery}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(data.table)
library(DBmaps)
```

## Introduction

In any data analysis project involving multiple tables, a primary task is to identify how these tables can be joined together. This typically involves matching foreign keys in one table to primary keys in another. The `map_join_paths()` function automates this discovery process, making it faster and less error-prone.

This function operates in two powerful modes:

1.  **Metadata-Driven Discovery**: It can find join paths based solely on a metadata definition, matching tables where a defined `grouping_variable` identically matches another table's `identifier_columns` (primary key). This is extremely fast and useful for well-documented schemas.

2.  **Data-Driven Discovery**: Optionally, by providing the actual data, the function can scan column values to find "inferred" joins where key names *do not* match. This is invaluable for exploring new or messy datasets.

## Case 1: Standard Joins via Metadata

First, let's define metadata for a standard retail scenario with `customers`, `products`, and `transactions` tables.

```{r setup_metadata}
# Define metadata for each table
customers_meta <- table_info("customers", "c.csv", "customer_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region")))))

products_meta <- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category")))))

transactions_meta <- table_info("transactions", "t.csv", "trans_id", list(
  list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list(
    # This grouping variable will match the primary key of 'customers'
    list(AggregatedName="a", AggregationFunction="sum", GroupingVariables="customer_id"),
    # This one will match the primary key of 'products'
    list(AggregatedName="b", AggregationFunction="sum", GroupingVariables="product_id")
  ))
))

# Combine into a master metadata object
master_meta <- rbindlist(list(customers_meta, products_meta, transactions_meta))
```

Now, we can find the join paths using only this metadata.

```{r metadata_only_run}
# Find paths without looking at the data
metadata_paths <- map_join_paths(master_meta)
print(metadata_paths)
```

The function correctly identifies two `METADATA` paths: `transactions` can be joined to `customers` on `customer_id`, and to `products` on `product_id`.

## Case 2: Multi-Variable Key Joins

The function also handles composite keys. Let's imagine a table `daily_promos` whose primary key is the combination of `product_id` and `region`.

```{r multi_key_setup}
daily_promos_meta <- table_info("daily_promos", "d.csv", c("product_id", "region"), list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region")))))

# Add a grouping variable to transactions that matches this composite key
transactions_multi_meta <- table_info("transactions", "t.csv", "trans_id", list(
  list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list(
    list(AggregatedName="promo_rev", AggregationFunction="sum", GroupingVariables=c("product_id", "region"))
  ))
))

multi_key_meta <- rbindlist(list(daily_promos_meta, transactions_multi_meta))
```

```{r multi_key_run}
multi_key_paths <- map_join_paths(multi_key_meta)
print(multi_key_paths)
```
The function correctly identifies the single, multi-variable join path.

## Case 3: Inferring Joins from Data

What if our data is messy and key names don't align? Consider an `inventory` table where the product key is called `sku`, and an `orders` table that refers to it as `product_code`.

```{r inferred_setup}
# Define the data
inventory_data <- data.table(sku = c("s1", "s2", "s3"), stock = c(10, 20, 5))
orders_data <- data.table(order_id = 1:2, customer_ref = "c1", product_code = c("s1", "s2"))

data_list <- list(
  inventory = inventory_data,
  orders = orders_data
)

# Define the metadata. Note the mismatched names.
inventory_meta <- table_info("inventory", "i.csv", "sku", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="stock")))))
orders_meta <- table_info("orders", "o.csv", "order_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="product_code")))))

inferred_meta <- rbindlist(list(inventory_meta, orders_meta))
```

Running `map_join_paths` with only metadata would fail to find a path here. But by providing the `data_list`, we enable the data-driven search.

```{r inferred_run}
inferred_paths <- map_join_paths(inferred_meta, data_list = data_list)
print(inferred_paths)
```
Success! The function scanned the values and found that `orders$product_code` can be joined to `inventory$sku`. It correctly marks this path as `INFERRED`.

## Combining Both Methods

The true power of the function is when it combines both methods, automatically de-duplicating and prioritizing explicit metadata joins over inferred ones. This provides a complete and reliable map of all possible connections in your data ecosystem.