--- title: "regextable" author: "Shirlyn Dong" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{regextable} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = FALSE, warning = FALSE, message = FALSE, tidy = FALSE, fig.align = 'center', fig.path = "man/figures/vignette-", R.options = list(width = 120) ) # Format kable tables kable <- function(x){ if (knitr::is_latex_output()) { head(x, 5) |> knitr::kable(booktabs = TRUE, format = 'latex') |> kableExtra::kable_styling(latex_options = c("striped", "scale_down", "HOLD_position")) } else { head(x, 5) |> knitr::kable(escape = TRUE) |> kableExtra::kable_styling() } } ``` ### Introduction `regextable` extracts regex-based pattern matches from a data frame or character vector using a pattern lookup table. For each input row, all matching patterns are returned, along with the matched substring, an internal row identifier, and additional columns specified in `data_return_cols` and `regex_return_cols`. Optional metadata from the pattern table can also be included. Multiple rows may be returned for a single text if it matches multiple patterns. Install and load the package: ```{r, message=FALSE} library(regextable) library(kableExtra) ``` ## Data For demonstration, we use two included datasets: - `members`: A lookup table of regex patterns for member names. - `cr2007_03_01`: A sample text dataset to search. ```{r} data("members") kable(members) data("cr2007_03_01") kable(subset(cr2007_03_01, select = -c(url, url_txt))) ``` ### Text Cleaning `extract()` cleans text by default, so the user does not need to call it manually. Cleaning standardizes spacing, punctuation, and capitalization, which helps regex pattern matching. Example of `clean_text()`: ```{r} text <- " HELLO---WORLD " clean_text(text) ``` ## Basic Extraction The simplest use of `extract()`: ```{r} result <- extract( data = cr2007_03_01, regex_table = members, data_return_cols = c("text"), regex_return_cols = c("icpsr") ) kable(head(result)) ``` Explanation: - `data`: the text dataset to search. - `col_name`: which column contains the text. - `regex_table`: the lookup table of patterns. - `data_return_cols`: additional columns from `data` to include in the result. - `regex_return_cols`: additional columns from the pattern table to attach. Each row in the output corresponds to a detected match, and includes both the original text and the matching pattern. --- ## Advanced Usage `extract()` can also filter data by date, remove acronyms (all-uppercase patterns with 2+ characters), and select specific output columns. This is useful for more controlled extraction. ```{r, eval = FALSE, include= FALSE} result_advanced <- extract( data = cr2007_03_01, regex_table = members, date_col = "date", date_start = "2007-01-01", date_end = "2007-12-31", remove_acronyms = TRUE, data_return_cols = c("text"), regex_return_cols = c("icpsr") ) kable(head(result_advanced)) ``` Explanation: - `date_col`, `date_start`, `date_end`: filter rows by date. - `remove_acronyms`: skip patterns like "NASA" or "USA". You can combine these filters with any subset of columns for flexible outputs. --- ### Parallel Matching `extract()` supports parallel processing via the `cl` parameter: ```{r, eval=FALSE} library(parallel) clust <- makeCluster(2) result_parallel <- extract( data = cr2007_03_01, regex_table = members, cl = clust, data_return_cols = c("text"), regex_return_cols = c("icpsr") ) stopCluster(clust) head(result_parallel) ``` ### Summary - `regextable` is a tool for extracting data from text. - Use the included datasets to get started or supply your own lookup tables. - `extract()` by default handles text cleaning and efficient matching. - Optional parameters allow advanced control over filtering and output.