README

This R package is designed to perform record linkage (also known as entity resolution) in unsupervised or supervised settings. It compares pairs of records from two datasets using selected comparison functions to estimate the probability or density ratio between matched and non-matched records. Based on these estimates, it predicts a set of matches that maximizes entropy.

Installation

install.packages("automatedRecLin")

# install.packages("pak") # uncomment if needed
pak::pkg_install("ncn-foreigners/automatedRecLin")

Basic usage

library(automatedRecLin)

Unsupervised maximum entropy classifier for record linkage

Generate two simple datasets that contain some common records, with typos in some cases.

df_1 <- data.frame(
  name = c("Emma", "Liam", "Olivia", "Noah", "Ava",
           "Ethan", "Sophia", "Mason", "Isabella", "James"),
  surname = c("Smith", "Johnson", "Williams", "Brown", "Jones",
              "Garcia", "Miller", "Davis", "Rodriguez", "Wilson"),
  city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix",
           "Philadelphia", "San Antonio", "San Diego", "Dallas", "San Jose")
)
df_2 <- data.frame(
  name = c(
    "Emma", "Liam", "Olivia", "Noah",
    "Ava", "Ehtan", "Sopia", "Mson",
    "Charlotte", "Benjamin", "Amelia", "Lucas"
  ),
  surname = c(
    "Smith", "Johnson", "Williams", "Brown",
    "Jnes", "Garca", "Miler", "Dvis",
    "Martinez", "Lee", "Hernandez", "Clark"
  ),
  city = c(
    "New York", "Los Angeles", "Chicago", "Houston",
    "Phonix", "Philadelpia", "San Antnio", "San Dieg",
    "Seattle", "Miami", "Boston", "Denver"
  )
)
df_1
#>        name   surname         city
#> 1      Emma     Smith     New York
#> 2      Liam   Johnson  Los Angeles
#> 3    Olivia  Williams      Chicago
#> 4      Noah     Brown      Houston
#> 5       Ava     Jones      Phoenix
#> 6     Ethan    Garcia Philadelphia
#> 7    Sophia    Miller  San Antonio
#> 8     Mason     Davis    San Diego
#> 9  Isabella Rodriguez       Dallas
#> 10    James    Wilson     San Jose
df_2
#>         name   surname        city
#> 1       Emma     Smith    New York
#> 2       Liam   Johnson Los Angeles
#> 3     Olivia  Williams     Chicago
#> 4       Noah     Brown     Houston
#> 5        Ava      Jnes      Phonix
#> 6      Ehtan     Garca Philadelpia
#> 7      Sopia     Miler  San Antnio
#> 8       Mson      Dvis    San Dieg
#> 9  Charlotte  Martinez     Seattle
#> 10  Benjamin       Lee       Miami
#> 11    Amelia Hernandez      Boston
#> 12     Lucas     Clark      Denver

Specify the key variables used for record linkage. Select a comparison function (i.e., a function to compare pairs of records) for each variable. For example, use jarowinkler_complement() from the automatedRecLin package (the Jaro-Winkler distance, i.e., 1 - Jaro-Winkler similarity). Choose a method for estimating the probability or density ratio for each variable. The available methods are: "binary", "continuous_parametric", "continuous_nonparametric", and "hit_miss" (only for unsupervised learning).

variables <- c("name", "surname", "city")
comparators <- list(
  "name" = jarowinkler_complement(),
  "surname" = jarowinkler_complement(),
  "city" = jarowinkler_complement()
)
methods <- list(
  "name" = "continuous_parametric",
  "surname" = "continuous_parametric",
  "city" = "continuous_parametric"
)

Perform record linkage using mec(). The output contains the following information:

set.seed(1)
unsup_result <- mec(A = df_1, B = df_2,
                    variables = variables,
                    comparators = comparators,
                    methods = methods)
unsup_result
#> Record linkage based on the following variables: name, surname, city.
#> ========================================================
#> The algorithm predicted 8 matches.
#> The first 6 predicted matches are:
#>        a     b ratio / 1000
#>    <int> <int>        <num>
#> 1:     6     6 1.433031e+08
#> 2:     8     8 3.198692e+07
#> 3:     7     7 9.673745e+05
#> 4:     5     5 2.813745e+04
#> 5:     1     1 3.375000e+00
#> 6:     2     2 3.375000e+00
#> ========================================================
#> The construction of the classification set was based on estimates of its size.
#> Estimated false link rate (FLR): 0.2066 %.
#> Estimated missing match rate (MMR): 0.2066 %.
#> ========================================================
#> Variables selected for the continuous parametric method: name, surname, city.
#> Estimated parameters for the continuous parametric method:
#>         variable p_0_M    alpha_M   beta_M      p_0_U  alpha_U    beta_U
#>           <char> <num>      <num>    <num>      <num>    <num>     <num>
#> 1:    gamma_name 0.625 138.462279 2199.107 0.04166667 6.516736 11.173089
#> 2: gamma_surname 0.500 120.665706 1974.530 0.03333333 4.622775  7.167261
#> 3:    gamma_city 0.500   6.512723  135.163 0.03333333 5.233194  9.313035

Supervised maximum entropy classifier for record linkage

Generate two simple training datasets that contain some common records, with typos in some cases.

df_1_train <- data.frame(
        name = c(
          "James", "Emma", "William", "Olivia", "Thomas",
          "Sophie", "Harry", "Amelia", "George", "Isabella"
          ),
        surname = c(
          "Smith", "Johnson", "Brown", "Taylor", "Wilson",
          "Davis", "Clark", "Harris", "Lewis", "Walker"
          )
)
df_2_train <- data.frame(
        name = c(
          "James", "Ema", "Wimliam", "Olivia", "Charlotte",
          "Henry", "Lucy", "Edward", "Alice", "Jack"
          ),
        surname = c(
          "Smith", "Johnson", "Bron", "Tailor", "Moore",
          "Evans", "Hall", "Wright", "Green", "King"
          )
)
df_1_train
#>        name surname
#> 1     James   Smith
#> 2      Emma Johnson
#> 3   William   Brown
#> 4    Olivia  Taylor
#> 5    Thomas  Wilson
#> 6    Sophie   Davis
#> 7     Harry   Clark
#> 8    Amelia  Harris
#> 9    George   Lewis
#> 10 Isabella  Walker
df_2_train
#>         name surname
#> 1      James   Smith
#> 2        Ema Johnson
#> 3    Wimliam    Bron
#> 4     Olivia  Tailor
#> 5  Charlotte   Moore
#> 6      Henry   Evans
#> 7       Lucy    Hall
#> 8     Edward  Wright
#> 9      Alice   Green
#> 10      Jack    King

Specify the key variables, select comparison functions and choose methods for estimating the probability or density ratio. Additionally, provide a data.frame or data.table indicating known matches.

variables_train <- c("name", "surname")
comparators_train <- list("name" = jarowinkler_complement(),
                          "surname" = jarowinkler_complement())
methods_train <- list("name" = "continuous_nonparametric",
                      "surname" = "continuous_nonparametric")
matches_train <- data.frame("a" = 1:4, "b" = 1:4)

model <- train_rec_lin(A = df_1_train, B = df_2_train,
                       matches = matches_train,
                       variables = variables_train,
                       comparators = comparators_train,
                       methods = methods_train)
#> Warning in check.sigma(nsigma, sigma_quantile, sigma, dist_nu): There are duplicate values in 'sigma', only the unique values are used.
#> Warning in check.sigma(nsigma, sigma_quantile, sigma, dist_nu): There are duplicate values in 'sigma', only the unique values are used.
model
#> Record linkage model based on the following variables: name, surname.
#> The prior probability of matching is 0.04.
#> ========================================================
#> Variables selected for the continuous nonparametric method: name, surname.
#> ========================================================
#> Probability/density ratio type: 1.

df_1_new <- data.frame(
  "name" = c("John", "Emily", "Mark", "Anna", "David"),
  "surname" = c("Smith", "Johnson", "Taylor", "Williams", "Brown")
)
df_2_new <- data.frame(
  "name" = c("John", "Emely", "Mark", "Michael"),
  "surname" = c("Smitth", "Johnson", "Tailor", "Henders")
)
df_1_new
#>    name  surname
#> 1  John    Smith
#> 2 Emily  Johnson
#> 3  Mark   Taylor
#> 4  Anna Williams
#> 5 David    Brown
df_2_new
#>      name surname
#> 1    John  Smitth
#> 2   Emely Johnson
#> 3    Mark  Tailor
#> 4 Michael Henders

Predict matches using the predict() method for rec_lin_model objects. The output has a similar structure to that of mec().

result_sup <- predict(model, df_1_new, df_2_new)
result_sup
#> The algorithm predicted 3 matches.
#> The first 3 predicted matches are:
#>        a     b ratio / 1000
#>    <int> <int>        <num>
#> 1:     2     2   0.04761403
#> 2:     3     3   0.02929836
#> 3:     1     1   0.02665718
#> ========================================================
#> The construction of the classification set was based on estimates of its size.
#> Estimated false link rate (FLR): 0.0000 %.
#> Estimated missing match rate (MMR): 0.0000 %.

Integration with a custom machine learning model

The automatedRecLin package supports supervised record linkage using a custom machine learning (ML) model that predicts the probability of matching based on comparison vectors (e.g., XGBoost, logistic regression). For example, install and load the xgboost package.

# install.packages("xgboost") # uncomment if needed
library(xgboost)

Use the same data, variables, and comparators as in the previous example. First, use comparison_vectors() to create comparison vectors that the model will be trained on.

vectors <- comparison_vectors(A = df_1_train, B = df_2_train,
                              variables = variables_train,
                              comparators = comparators_train,
                              matches = matches_train)
vectors
#> Comparison based on the following variables: name, surname.
#> ========================================================
#>        a     b gamma_name gamma_surname match
#>    <int> <int>      <num>         <num> <num>
#> 1:     1     1  0.0000000     0.0000000     1
#> 2:     1     2  0.4777778     0.5523810     0
#> 3:     1     3  0.5523810     1.0000000     0
#> 4:     1     4  1.0000000     0.5444444     0
#> 5:     1     5  0.5629630     1.0000000     0
#> 6:     1     6  1.0000000     1.0000000     0

model_xgb <- xgboost(x = as.matrix(vectors$Omega[, c("gamma_name", "gamma_surname")]),
                     y = factor(vectors$Omega$match),
                     objective = "binary:logistic", eval_metric = "logloss",
                     nrounds = 100, verbosity = 0, nthread = 1)

custom_xgb_model <- custom_rec_lin_model(model_xgb, vectors)
custom_xgb_model
#> Record linkage model based on the following variables: name, surname.
#> A custom ML model was used.
#> The prior probability of matching is 0.04.
#> ========================================================
#> Probability/density ratio type: 2.

Use the model for predictions. Note that the xgboost package requires a matrix as input for the predict() method for rec_lin_model objects and that needs to be specified in the data_type argument. Set type = "response" to ensure the XGBoost model predicts the probability of matching (this argument may vary depending on the model or library used).

result_xgb <- predict(custom_xgb_model, df_1_new, df_2_new,
                      data_type = "matrix", type = "response")
result_xgb
#> The algorithm predicted 3 matches.
#> The first 3 predicted matches are:
#>        a     b ratio / 1000
#>    <int> <int>        <num>
#> 1:     1     1    0.0299477
#> 2:     2     2    0.0299477
#> 3:     3     3    0.0299477
#> ========================================================
#> The construction of the classification set was based on estimates of its size.
#> Estimated false link rate (FLR): 15.9112 %.
#> Estimated missing match rate (MMR): 15.9112 %.

Integration with the blocking package

The automatedRecLin R package can be integrated with the blocking package. For details, see the vignette entitled MEC with blocking.

Funding

Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941 (Towards census-like statistics for foreign-born populations – quality, data integration and estimation).

The `automatedRecLin` Package

Description

Installation

Basic usage

Unsupervised maximum entropy classifier for record linkage

Supervised maximum entropy classifier for record linkage

Integration with a custom machine learning model

Integration with the `blocking` package

Funding

References

The automatedRecLin Package

Description

Installation

Basic usage

Unsupervised maximum entropy classifier for record linkage

Supervised maximum entropy classifier for record linkage

Integration with a custom machine learning model

Integration with the blocking package

Funding

References

The `automatedRecLin` Package

Integration with the `blocking` package