---
title: "regextable"
author: "Shirlyn Dong"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{regextable}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = FALSE,
  warning = FALSE,
  message = FALSE,
  tidy = FALSE,
  fig.align = 'center',
  fig.path = "man/figures/vignette-",
  R.options = list(width = 120)
)

# Format kable tables
kable <- function(x){
  if (knitr::is_latex_output()) {
    head(x, 5) |>
      knitr::kable(booktabs = TRUE, format = 'latex') |>
      kableExtra::kable_styling(latex_options = c("striped", "scale_down", "HOLD_position"))
  } else {
    head(x, 5) |>
      knitr::kable(escape = TRUE) |>
      kableExtra::kable_styling()
  }
}
```


### Introduction

`regextable` extracts regex-based pattern matches from a data frame or character vector using a pattern lookup table. For each input row, all matching patterns are returned, along with the matched substring, an internal row identifier, and additional columns specified in `data_return_cols` and `regex_return_cols`. Optional metadata from the pattern table can also be included. Multiple rows may be returned for a single text if it matches multiple patterns.

Install and load the package:
```{r, message=FALSE}
library(regextable)
library(kableExtra)
```

## Data

For demonstration, we use two included datasets:

- `members`: A lookup table of regex patterns for member names.
- `cr2007_03_01`: A sample text dataset to search.

```{r}
data("members")
kable(members)

data("cr2007_03_01")
kable(subset(cr2007_03_01, select = -c(url, url_txt)))
```

### Text Cleaning

`extract()` cleans text by default, so the user does not need to call it manually. Cleaning standardizes spacing, punctuation, and capitalization, which helps regex pattern matching.

Example of `clean_text()`:
```{r}
text <- "  HELLO---WORLD  "
clean_text(text)
```

## Basic Extraction

The simplest use of `extract()`:

```{r}
result <- extract(
  data = cr2007_03_01,
  regex_table = members,
  data_return_cols = c("text"),
  regex_return_cols = c("icpsr")
)

kable(head(result))
```

Explanation:
- `data`: the text dataset to search.
- `col_name`: which column contains the text.
- `regex_table`: the lookup table of patterns.
- `data_return_cols`: additional columns from `data` to include in the result.
- `regex_return_cols`: additional columns from the pattern table to attach.
Each row in the output corresponds to a detected match, and includes both the original text and the matching pattern.
---

## Advanced Usage

`extract()` can also filter data by date, remove acronyms (all-uppercase patterns with 2+ characters), and select specific output columns. This is useful for more controlled extraction.

```{r, eval = FALSE, include= FALSE}
result_advanced <- extract(
  data = cr2007_03_01,
  regex_table = members,
  date_col = "date",
  date_start = "2007-01-01",
  date_end = "2007-12-31",
  remove_acronyms = TRUE,
  data_return_cols = c("text"),
  regex_return_cols = c("icpsr")
)

kable(head(result_advanced))
```

Explanation:
- `date_col`, `date_start`, `date_end`: filter rows by date.
- `remove_acronyms`: skip patterns like "NASA" or "USA".
You can combine these filters with any subset of columns for flexible outputs.
---

### Parallel Matching

`extract()` supports parallel processing via the `cl` parameter:

```{r, eval=FALSE}
library(parallel)
clust <- makeCluster(2)
result_parallel <- extract(
  data = cr2007_03_01,
  regex_table = members,
  cl = clust,
  data_return_cols = c("text"),
  regex_return_cols = c("icpsr")
)
stopCluster(clust)
head(result_parallel)
```

### Summary
- `regextable` is a tool for extracting data from text.
- Use the included datasets to get started or supply your own lookup tables.
- `extract()` by default handles text cleaning and efficient matching.
- Optional parameters allow advanced control over filtering and output.