| Type: | Package |
| Title: | A Lightweight and Versatile NLP Toolkit |
| Version: | 1.1.0 |
| Maintainer: | Jason Timm <JaTimm@salud.unm.edu> |
| Description: | A lightweight toolkit for text retrieval and NLP with a consistent and predictable API organized around four actions: fetching, reading, processing, and searching. Functions cover the full pipeline from web data acquisition to text processing and indexing. Multiple search strategies are supported including regex, BM25 keyword ranking, cosine similarity, and dictionary matching. Pipe-friendly with no heavy dependencies and all outputs are plain data frames. Also useful as a building block for retrieval-augmented generation pipelines and autonomous agent workflows. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| Depends: | R (≥ 3.5) |
| Imports: | data.table, httr, Matrix, rvest, stringi, stringr, xml2, pbapply, jsonlite, lubridate |
| Suggests: | SnowballC (≥ 0.7.0) |
| RoxygenNote: | 7.3.3 |
| URL: | https://github.com/jaytimm/textpress, https://jaytimm.github.io/textpress/ |
| BugReports: | https://github.com/jaytimm/textpress/issues |
| NeedsCompilation: | no |
| Packaged: | 2026-02-23 15:20:38 UTC; jtimm |
| Author: | Jason Timm [aut, cre] (year: 2026) |
| Repository: | CRAN |
| Date/Publication: | 2026-02-23 15:40:02 UTC |
textpress: A Lightweight and Versatile NLP Toolkit
Description
A lightweight NLP toolkit for R organized as a four-stage pipeline: fetch (URLs from search/Wikipedia), read (content from URLs), process (split, tokenize, index), and search (regex, BM25, vector similarity, dictionary). Uses verb_noun naming for discoverability. Minimal dependencies; embeddings are built elsewhere and passed in for semantic search.
Author(s)
Maintainer: Jason Timm JaTimm@salud.unm.edu (2026)
See Also
Useful links:
Report bugs at https://github.com/jaytimm/textpress/issues
Common Abbreviations for Linguistic Processing
Description
A named list containing common abbreviations used in text analysis.
Usage
abbreviations
Format
A named list with the following components:
abbreviationsA character vector of common abbreviations, including titles, months, and standard abbreviations.
Source
Internally compiled linguistic resource.
Demo dictionary of generation-name variants for NER
Description
A small dictionary of generational cohort terms (Greatest, Silent, Boomers,
Gen X, Millennials, Gen Z, Alpha, etc.) and spelling/variant forms, for use
with search_dict. Built in-package (no data()).
Usage
dict_generations
Format
A data frame with columns variant (surface form to match), TermName (standardized label), is_cusp (logical), start and end (birth year range; Pew definitions where applicable, see https://github.com/jaytimm/AmericanGenerations/blob/main/data/pew-generations.csv).
Examples
head(dict_generations)
# use as term list: search_dict(corpus, by = "doc_id", terms = dict_generations$variant)
Demo dictionary of political / partisan term variants for NER
Description
A small dictionary of political party and ideology terms (Democrat, Republican,
MAGA, Liberal, Conservative, Christian Nationalist, White Supremacist, etc.)
and spelling/variant forms, for use with search_dict. Built in-package (no data()).
Usage
dict_political
Format
A data frame with columns variant (surface form to match) and TermName (standardized label).
Examples
head(dict_political)
# search_dict(corpus, by = "doc_id", terms = dict_political$variant)
Fetch URLs from a search engine
Description
Queries DuckDuckGo Lite and returns result URLs (no local text search).
Use read_urls to get content from these URLs.
Usage
fetch_urls(query, n_pages = 1, date_filter = "w")
Arguments
query |
Search query string. |
n_pages |
Number of DDG Lite pages to fetch (default 1). ~30 results per page. |
date_filter |
Recency filter: |
Value
A data.table with columns search_engine, url, is_excluded.
Examples
## Not run:
urls_dt <- fetch_urls("R programming nlp", n_pages = 1)
urls_dt$url
## End(Not run)
Fetch external citation URLs from Wikipedia
Description
Searches Wikipedia for a topic, then returns external citation URLs from
the first matching page's references section. Use read_urls
to scrape content from those URLs.
Usage
fetch_wiki_refs(query, n = 10)
Arguments
query |
Search phrase (e.g. "January 6 Capitol attack"). |
n |
Number of citation URLs to return (default 10). |
Value
A character vector of external citation URLs (prefers archived when present).
Examples
## Not run:
ref_urls <- fetch_wiki_refs("January 6 Capitol attack", n = 10)
articles <- read_urls(ref_urls)
## End(Not run)
Fetch Wikipedia page URLs by search query
Description
Uses the MediaWiki API to get Wikipedia article URLs matching a keyword.
Does not search your local corpus; it retrieves links from Wikipedia.
Use read_urls to get article content from these URLs.
Usage
fetch_wiki_urls(query, limit = 10)
Arguments
query |
Search phrase (e.g. "117th Congress"). |
limit |
Number of page URLs to return (default 10). |
Value
A character vector of full Wikipedia article URLs.
Examples
## Not run:
wiki_urls <- fetch_wiki_urls("January 6 Capitol attack")
corpus <- read_urls(wiki_urls[1])
## End(Not run)
Get the search URL(s) used by fetch_urls (for debugging or browser use)
Description
Page 2+ require POST; only page 1 is a direct browser URL.
Usage
get_search_urls(query, n_pages = 1, date_filter = "w")
Arguments
query |
Search query string. |
n_pages |
Number of pages (informational for page 2+). |
date_filter |
Recency filter: |
Value
Named character vector of URLs.
Convert Token List to Data Frame
Description
This function converts a list of tokens into a data frame, extracting and separating document and sentence identifiers if needed.
Usage
nlp_cast_tokens(tok)
Arguments
tok |
A list where each element contains tokens corresponding to a document or a sentence. |
Value
A data frame with columns for token name and token.
Examples
tok <- list(
tokens = list(
"1.1" = c("Hello", "world", "."),
"1.2" = c("This", "is", "an", "example", "."),
"2.1" = c("This", "is", "a", "party", "!")
)
)
dtm <- nlp_cast_tokens(tok)
Create a BM25 Search Index
Description
Create a BM25 Search Index
Usage
nlp_index_tokens(tokens, k1 = 1.2, b = 0.75, stem = FALSE)
Arguments
tokens |
A named list of character vectors. |
k1 |
BM25 saturation parameter. |
b |
BM25 length normalization. |
stem |
Logical; if TRUE, stems tokens. |
Roll units into fixed-size chunks with optional context
Description
Groups consecutive rows at the finest by level (e.g. sentences) into
fixed-size chunks and optionally adds surrounding context. Like a rolling
window over the leaf units.
Usage
nlp_roll_chunks(corpus, by, chunk_size, context_size)
Arguments
corpus |
A data frame or data.table containing a |
by |
A character vector of column names used as unique identifiers.
The last column determines the search unit and is the level rolled into chunks (e.g., if |
chunk_size |
Integer. Number of units per chunk. |
context_size |
Integer. Number of units of context around each chunk. |
Value
A data.table with chunk_id, chunk (concatenated text), and chunk_plus_context.
Examples
corpus <- data.frame(doc_id = c('1', '1', '2'),
sentence_id = c('1', '2', '1'),
text = c("Hello world.",
"This is an example.",
"This is a party!"))
chunks <- nlp_roll_chunks(corpus, by = c('doc_id', 'sentence_id'),
chunk_size = 2, context_size = 1)
Split Text into Paragraphs
Description
Splits text from the 'text' column of a data frame into individual paragraphs, based on a specified paragraph delimiter.
Usage
nlp_split_paragraphs(corpus, paragraph_delim = "\\n+")
Arguments
corpus |
A data frame or data.table containing a |
paragraph_delim |
A regular expression pattern used to split text into paragraphs. |
Value
A data.table with columns: 'doc_id', 'paragraph_id', and 'text'. Each row represents a paragraph, along with its associated document and paragraph identifiers.
Examples
corpus <- data.frame(doc_id = c('1', '2'),
text = c("Hello world.\n\nMind your business!",
"This is an example.n\nThis is a party!"))
paragraphs <- nlp_split_paragraphs(corpus)
Split Text into Sentences
Description
This function splits text from a data frame into individual sentences based on specified columns and handles abbreviations effectively.
Usage
nlp_split_sentences(
corpus,
by = c("doc_id"),
abbreviations = textpress::abbreviations
)
Arguments
corpus |
A data frame or data.table containing a |
by |
A character vector of column names used as unique identifiers.
The last column determines the search unit (e.g., if |
abbreviations |
A character vector of abbreviations to handle during sentence splitting, defaults to textpress::abbreviations. |
Value
A data.table with columns from by, plus sentence_id, text, start, end.
Examples
corpus <- data.frame(doc_id = c('1'),
text = c("Hello world. This is an example. No, this is a party!"))
sentences <- nlp_split_sentences(corpus)
Tokenize Text Data (mostly) Non-Destructively
Description
Tokenizes text from a corpus data frame, preserving structure like capitalization and punctuation.
Usage
nlp_tokenize_text(
corpus,
by = c("doc_id", "paragraph_id", "sentence_id"),
include_spans = TRUE,
method = "word"
)
Arguments
corpus |
A data frame or data.table containing a |
by |
A character vector of column names used as unique identifiers.
The last column determines the search unit (e.g., if |
include_spans |
Logical. Include start/end character spans for each token. |
method |
Character. |
Value
A named list of tokens (or list of tokens and spans if include_spans = TRUE).
Examples
corpus <- data.frame(doc_id = c('1', '1', '2'),
sentence_id = c('1', '2', '1'),
text = c("Hello world.",
"This is an example.",
"This is a party!"))
tokens <- nlp_tokenize_text(corpus, by = c('doc_id', 'sentence_id'))
Read content from URLs
Description
Fetches each URL and returns a structured data frame (one row per node:
headings, paragraphs, lists). Like read_csv or read_html: bring
an external resource into R. Follows fetch_urls() or fetch_wiki_urls()
in the pipeline: fetch = get locations, read = get text.
Usage
read_urls(x, cores = 1, detect_boilerplate = TRUE, remove_boilerplate = TRUE)
Arguments
x |
A character vector of URLs. |
cores |
Number of cores for parallel requests (default 1). |
detect_boilerplate |
Logical. Detect boilerplate (e.g. sign-up, related links). |
remove_boilerplate |
Logical. If |
Details
Wikipedia is handled with high-fidelity selectors: div.mw-parser-output
and h2/h3/h4 hierarchy. Use parent_heading to see
which section each node belongs to. The “External links” section and
rows with empty text are omitted.
Value
A data frame with url, h1_title, date, type, node_id, parent_heading, text, and optionally is_boilerplate.
Examples
## Not run:
urls <- fetch_urls("R programming", n_pages = 1)$url
nodes <- read_urls(urls[1:3], cores = 1)
## End(Not run)
Exact n-gram matcher (vector of terms)
Description
Find a long list of multi-word expressions (MWEs) or terms without regex
overhead or partial-match risks. Tokenize corpus, build n-grams, then exact
join against terms. Word boundaries are respected by design. For
categories (e.g. term = "R Project", category = "Software"), left_join your
metadata onto the result using ngram or term as key.
Usage
search_dict(corpus, by = c("doc_id"), terms, n_min = 1, n_max = 5)
Arguments
corpus |
The text data (data frame or data.table with |
by |
Identifier columns (e.g. |
terms |
A character vector of terms/variants to find (e.g. |
n_min |
Integer. Minimum n-gram size (default 1). |
n_max |
Integer. Maximum n-gram size (default 5). |
Value
A data.table with id, start, end, n, ngram, term (the matched term from terms).
Examples
corpus <- data.frame(doc_id = "1", text = "Gen Z and Millennials use social media.")
search_dict(corpus, by = "doc_id", terms = c("Gen Z", "Millennials", "social media"))
Search the BM25 Index
Description
Search the BM25 Index
Usage
search_index(index, query, n = 10, stem = FALSE)
Arguments
index |
A data.table created by nlp_index_tokens. |
query |
A character string. |
n |
Number of results to return. |
stem |
Logical; must match the setting used during indexing. |
Value
A data.table of results ranked by score.
Search corpus via regex
Description
Search corpus via regex
Usage
search_regex(corpus, query, by, highlight = c("<b>", "</b>"))
Arguments
corpus |
A data frame or data.table with a |
query |
The search pattern (regex). |
by |
Character vector of identifier columns. |
highlight |
Length-two character vector (default |
Vector search by cosine similarity
Description
Returns the top-n matches from an embedding matrix for one or more query vectors.
Subject-first: embeddings (haystack) then query (needle), pipe-friendly.
Usage
search_vector(embeddings, query, n = 10)
Arguments
embeddings |
A numeric or sparse matrix of embeddings (rows = searchable units). |
query |
A character (row name in |
n |
Number of results to return per query (default 10). |
Value
A data frame (or list of data frames if multiple queries are provided) containing the match identifiers and similarity scores.
Fetch embeddings (Hugging Face utility)
Description
Fetch embeddings (Hugging Face utility)
Usage
util_fetch_embeddings(
corpus,
by,
api_token,
api_url = "https://router.huggingface.co/hf-inference/models/BAAI/bge-small-en-v1.5"
)
Arguments
corpus |
A data frame or data.table with a |
by |
Character vector of column names that identify each text unit. |
api_token |
Your Hugging Face API token. |
api_url |
The inference endpoint URL. |
Value
A numeric matrix with row names derived from by.