--- title: "Automated Join Path Discovery" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Automated Join Path Discovery} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(data.table) library(DBmaps) ``` ## Introduction In any data analysis project involving multiple tables, a primary task is to identify how these tables can be joined together. This typically involves matching foreign keys in one table to primary keys in another. The `map_join_paths()` function automates this discovery process, making it faster and less error-prone. This function operates in two powerful modes: 1. **Metadata-Driven Discovery**: It can find join paths based solely on a metadata definition, matching tables where a defined `grouping_variable` identically matches another table's `identifier_columns` (primary key). This is extremely fast and useful for well-documented schemas. 2. **Data-Driven Discovery**: Optionally, by providing the actual data, the function can scan column values to find "inferred" joins where key names *do not* match. This is invaluable for exploring new or messy datasets. ## Case 1: Standard Joins via Metadata First, let's define metadata for a standard retail scenario with `customers`, `products`, and `transactions` tables. ```{r setup_metadata} # Define metadata for each table customers_meta <- table_info("customers", "c.csv", "customer_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region"))))) products_meta <- table_info("products", "p.csv", "product_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="category"))))) transactions_meta <- table_info("transactions", "t.csv", "trans_id", list( list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list( # This grouping variable will match the primary key of 'customers' list(AggregatedName="a", AggregationFunction="sum", GroupingVariables="customer_id"), # This one will match the primary key of 'products' list(AggregatedName="b", AggregationFunction="sum", GroupingVariables="product_id") )) )) # Combine into a master metadata object master_meta <- rbindlist(list(customers_meta, products_meta, transactions_meta)) ``` Now, we can find the join paths using only this metadata. ```{r metadata_only_run} # Find paths without looking at the data metadata_paths <- map_join_paths(master_meta) print(metadata_paths) ``` The function correctly identifies two `METADATA` paths: `transactions` can be joined to `customers` on `customer_id`, and to `products` on `product_id`. ## Case 2: Multi-Variable Key Joins The function also handles composite keys. Let's imagine a table `daily_promos` whose primary key is the combination of `product_id` and `region`. ```{r multi_key_setup} daily_promos_meta <- table_info("daily_promos", "d.csv", c("product_id", "region"), list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="region"))))) # Add a grouping variable to transactions that matches this composite key transactions_multi_meta <- table_info("transactions", "t.csv", "trans_id", list( list(OutcomeName="rev", ValueExpression=1, AggregationMethods=list( list(AggregatedName="promo_rev", AggregationFunction="sum", GroupingVariables=c("product_id", "region")) )) )) multi_key_meta <- rbindlist(list(daily_promos_meta, transactions_multi_meta)) ``` ```{r multi_key_run} multi_key_paths <- map_join_paths(multi_key_meta) print(multi_key_paths) ``` The function correctly identifies the single, multi-variable join path. ## Case 3: Inferring Joins from Data What if our data is messy and key names don't align? Consider an `inventory` table where the product key is called `sku`, and an `orders` table that refers to it as `product_code`. ```{r inferred_setup} # Define the data inventory_data <- data.table(sku = c("s1", "s2", "s3"), stock = c(10, 20, 5)) orders_data <- data.table(order_id = 1:2, customer_ref = "c1", product_code = c("s1", "s2")) data_list <- list( inventory = inventory_data, orders = orders_data ) # Define the metadata. Note the mismatched names. inventory_meta <- table_info("inventory", "i.csv", "sku", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="stock"))))) orders_meta <- table_info("orders", "o.csv", "order_id", list(list(OutcomeName="x",ValueExpression=1,AggregationMethods=list(list(AggregatedName="y",AggregationFunction="z",GroupingVariables="product_code"))))) inferred_meta <- rbindlist(list(inventory_meta, orders_meta)) ``` Running `map_join_paths` with only metadata would fail to find a path here. But by providing the `data_list`, we enable the data-driven search. ```{r inferred_run} inferred_paths <- map_join_paths(inferred_meta, data_list = data_list) print(inferred_paths) ``` Success! The function scanned the values and found that `orders$product_code` can be joined to `inventory$sku`. It correctly marks this path as `INFERRED`. ## Combining Both Methods The true power of the function is when it combines both methods, automatically de-duplicating and prioritizing explicit metadata joins over inferred ones. This provides a complete and reliable map of all possible connections in your data ecosystem.