--- title: "Getting Started with spell.replacer" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with spell.replacer} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(spell.replacer) ``` # Introduction The `spell.replacer` package provides probabilistic spelling correction for character vectors in R. It uses the Jaro-Winkler string distance metric combined with word frequency data from the Corpus of Contemporary American English (COCA) to automatically correct misspelled words. ## Basic Usage The main function is `spell_replace()`, which takes a character vector and returns it with corrected spellings: ```{r basic_example} # Example text with misspellings text <- c("This is a smple text with some mispelled words.", "We can corect them automaticaly.") # Apply spell correction corrected_text <- spell_replace(text) print(corrected_text) ``` ## How It Works The package uses a two-step process: 1. **Identify misspelled words**: Uses the `hunspell` package to identify words not found in standard dictionaries 2. **Find corrections**: For each misspelled word, calculates Jaro-Winkler distance to words in the COCA frequency list and selects the best match ## Customizing Correction You can adjust the correction behavior with several parameters: ```{r custom_example} # More restrictive threshold (fewer corrections) conservative <- spell_replace(text, threshold = 0.08) # Ignore potential proper names text_with_names <- "John went to Bostan yesterday." corrected_names <- spell_replace(text_with_names, ignore_names = TRUE) print(corrected_names) ``` ## Single Word Correction You can also correct individual words using the `correct()` function: ```{r single_word} # Correct a single word corrected_word <- correct("recieve", coca_list) print(corrected_word) ``` ## Working with Dataframes One of the main benefits of `spell.replacer` is that it integrates seamlessly with tidyverse workflows. You can easily apply spell correction to entire columns of text data: ```{r dataframe_example, eval = FALSE} library(dplyr) # Example dataframe with text column docs <- data.frame( id = 1:3, text = c("This docment has misspellings.", "Anothr exmple with erors.", "The finl text sampel.") ) # Apply spell correction using tidy syntax docs %>% mutate(text = spell_replace(text)) ``` ### Performance The package processes approximately **1,000 words per second**, making it suitable for large-scale text processing tasks. For example: - A 100,000 word corpus would take about 1.7 minutes - A 1,000,000 word corpus would take about 16 minutes This makes `spell.replacer` practical for preprocessing large text datasets before analysis. ## Word Frequency Data The package includes the `coca_list` dataset with the 100,000 most frequent words from COCA: ```{r coca_data} # Most frequent words head(coca_list, 10) # Check if a word is in the list "hello" %in% coca_list # Find the frequency rank of a word which(coca_list == "hello") ```