--- title: "`{DBI}`/`{dbplyr}` backend" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{`{DBI}`/`{dbplyr}` backend} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ## Motivation Connecting to Databricks SQL Warehouses *typically* involves using one of the methods below: +-----------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------+ | Method | Advantages | Disadvantages | +=========================================================================================+===============================================================================+======================================================================+ | ODBC | Works with various authentication mechanisms, robust, acceptable performance. | Heavy install that can be difficult to install on corporate devices. | | | | | | | | Very slow with larger write operations. | +-----------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------+ | JDBC | Not sure to be honest. | Slower than ODBC. Requires Java. | | | | | | | | Very slow with larger write operations. | +-----------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------+ | [Statement Execution API](https://docs.databricks.com/api/workspace/statementexecution) | Straightforward API that can handle data of all sizes. | No direct support for write operations. | +-----------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------+ | Databricks SQL Connector\ | Convenient to work with and good performance. | It's in Python, requires `{reticulate}`. | | (`databricks-sql-connector`) | | | +-----------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------------------------------------------------------------------+ That leaves a void for something that is: - Native to R - Easy to install - Acceptable performance (or better) - Handles bulk uploads efficiently - Supports OAuth U2M (no tokens!) `{brickster}` now provides both `{DBI}` and `{dbplyr}` backends for working with Databricks SQL warehouses. It uses the [Statement Execution API](https://docs.databricks.com/api/workspace/statementexecution) in combination with the [Files API](https://docs.databricks.com/api/workspace/files) (the latter specifically for large data uploads). ## Connecting to a Warehouse Connecting to a Databricks SQL warehouse requires creating a `{DBI}` connection: ```{r setup} library(brickster) library(DBI) library(dplyr) library(dbplyr) # Create connection to SQL warehouse con <- dbConnect( drv = DatabricksSQL(), warehouse_id = "", catalog = "samples" ) ``` ## `{DBI}` Backend ### Reading Data Execute SQL directly and return results as data frames: ```{r} trips <- dbGetQuery(con, "SELECT * FROM samples.nyctaxi.trips LIMIT 10") ``` Tables can be references either via `Id()`, `I()`, or using the name as-is: ```{r} # List available tables tables <- dbListTables(con) # Check if specific table exists dbExistsTable(con, "samples.nyctaxi.trips") dbExistsTable(con, I("samples.nyctaxi.trips")) dbExistsTable(con, Id("samples", "nyctaxi", "trips")) # Get column information dbListFields(con, "samples.nyctaxi.trips") dbListFields(con, I("samples.nyctaxi.trips")) dbListFields(con, Id("samples", "nyctaxi", "trips")) ``` ### Writing Data When writing data (append, overwrite, etc) there are two possible behaviours: 1. In-line SQL statement write 2. Stage `.parquet` files to a Volume directory and `COPY INTO` or `CTAS` For data larger than 50k rows `{brickster}` will only permit the use of method (2), which requires `staging_volume` to be specified when establishing the connection, or directly to `dbWriteQuery`. Ensure that `staging_volume` is a valid Volume path and you have permission to write files. ```{r} # small data (150 rows) # generates an in-line CTAS dbWriteTable( conn = con, name = Id(catalog = "", schema = "", table = ""), value = iris, overwrite = TRUE ) # bigger data (4 million rows) # writes parquet files to volume then CTAS iris_big <- sample_n(iris, replace = TRUE, size = 4000000) dbWriteTable( conn = con, name = Id(catalog = "", schema = "", table = "
"), value = iris_big, overwrite = TRUE, staging_volume = "/Volumes////..." # or inherited from connection progress = TRUE ) ``` ## `{dbplyr}` Backend ### Reading Data As in the `{DBI}` backend tables can be referenced either via `Id()`, `I()`, or using the name as-is in `tbl()`: ```{r} # Connect to existing tables tbl(con, "samples.nyctaxi.trips") tbl(con, I("samples.nyctaxi.trips")) tbl(con, Id("samples", "nyctaxi", "trips")) tbl(con, in_catalog("samples", "nyctaxi", "trips")) ``` Chain dplyr operations - they execute remotely on Databricks: ```{r} # Filter and select (translated to SQL) long_trips <- tbl(con, "samples.nyctaxi.trips") |> filter(trip_distance > 10) |> select( tpep_pickup_datetime, tpep_dropoff_datetime, trip_distance, fare_amount ) # View the generated SQL (without executing) show_query(long_trips) # Execute and collect results long_trips |> collect() ``` As a general reminder, call `collect()` at the latest point possible in your analysis to take reduce the required computation locally. ```{r} # Customer summary statistics trips_summary <- tbl(con, "samples.nyctaxi.trips") |> group_by(pickup_zip) %>% summarise( trip_count = n(), total_fare_amount = sum(fare_amount, na.rm = TRUE), total_trip_distance = sum(trip_distance, na.rm = TRUE), avg_fare_amount = mean(fare_amount, na.rm = TRUE) ) |> arrange(desc(avg_fare_amount)) # Execute to get the 20 most expensive pickip zip codes with more than 30 trips top_zipz <- trips_summary |> filter(trip_count > 20) |> head(20) |> collect() ``` ## Writing Data A key difference is that temporary tables are not supported - this makes functions like `copy_to` only usable when specifying `temporary` as `FALSE` which will use `dbWriteTable` to create a table. ```{r} iris_remote <- copy_to(con, iris, "iris_table", temporary = FALSE, overwrite = TRUE) ``` ## Connection Management The connection for the `DatabricksSQL` driver is different to other `{DBI}` backends as it doesn't have a persistent session (it's all just API calls). This means calling `dbDisconnect` serves no purpose when it comes to freeing resources on the SQL warehouse. ## `{WebR}` Support `{bricksters}` core dependencies are all [`{WebR}`](https://docs.r-wasm.org/webr/latest/) compatible. The backend uses `{nanoarrow}` as a fallback when `{arrow}` is unavailable (currently `{arrow}` is a `Suggests`). It's recommended to always have `{arrow}` installed to improve the performance of data loading.