| Type: | Package | 
| Title: | Persistent Data Anonymization Pipeline | 
| Version: | 1.0.0 | 
| Description: | A framework for the replicable removal of personally identifiable data (PID) in data sets. The package implements a suite of methods to suit different data types based on the suggestions of Garfinkel (2015) <doi:10.6028/NIST.IR.8053> and the ICO "Guidelines on Anonymization" (2012) https://ico.org.uk/media/1061/anonymisation-code.pdf. | 
| License: | MIT + file LICENSE | 
| Encoding: | UTF-8 | 
| LazyData: | true | 
| RoxygenNote: | 7.2.1 | 
| Suggests: | testthat (≥ 3.0.0), checkmate, knitr, rmarkdown, roxygen2 | 
| Imports: | R6, dplyr, openssl, tidyselect, rlang (≥ 0.4.11), glue, purrr, stringr, yaml, readr, openxlsx, lemon, withr, fs | 
| Depends: | R (≥ 2.10) | 
| VignetteBuilder: | knitr | 
| Language: | en-GB | 
| NeedsCompilation: | no | 
| Packaged: | 2024-11-18 11:36:38 UTC; rzc1 | 
| Author: | Robert Cook | 
| Maintainer: | Robert Cook <robert.cook@staffs.ac.uk> | 
| Repository: | CRAN | 
| Date/Publication: | 2024-11-19 09:50:06 UTC | 
Base class for all De-identifier classes
Description
Create new Deidentifier object
Setter for 'method' field
Save 'Deidentifier' to serialized object.
Apply 'method' to a vector of values
Apply 'method' to variables in a data frame
Apply 'mutate' method to an aggregated data frame.
Aggregate a data frame and apply 'mutate' to each.
Convert self to a list
String representation of self
Check if parameters are in allowed fields
Arguments
| method | New function to be used as the method. | 
| location | File path to save to. | 
| keys | Vector of values to be processed | 
| force | Perform transformation on all variables even if some given are not in the data. | 
| grouped_data | a 'grouped_df' object | 
| data | A data frame to be manipulated | 
| grp_cols | Vector of variables in 'data' to group on. | 
| mutate_cols | Vector of variables in 'data' to transform. | 
| type | character vector describing the object. Defaults to class. | 
| ... | Options to check exist | 
Fields
- method
- Function to call for data transform. 
Deidentifier class for applying 'blur' transform
Description
Convert self to a list.
Arguments
| blur | Look-up list to define aggregation. | 
| keys | Vector of values to be processed | 
| ... | Values to be concatenated to keys | 
Details
'Bluring' refers to aggregation of data e.g. converting city to country, or post code to IMD. The level of blurring is defined by the list given at initialization which maps key to value e.g. list(London = "England", Paris = "France").
Value
Blurer
Apply blur to a vector of values
Fields
- blur
- List of aggregations to be applied. Create new Blurer object 
R6 class for the removal of variables from a pipeline
Description
A Deident class dealing with the exclusion of variables.
Deidentifier class for applying 'encryption' transform
Description
Create new Encrypter object
Convert self to a list.
Arguments
| hash_key | An alpha numeric key for use in encryption | 
| seed | An alpha numeric key which is concatenated to minimize brute force attacks | 
| keys | Vector of values to be processed | 
| ... | Values to be concatenated to keys | 
Details
'Encrypting' refers to the cryptographic hashing of data e.g. md5 checksum. Encryption is more powerful if a random hash and seed are supplied and kept secret.
Value
Encrypter
Apply blur to a vector of values
Fields
- hash_key
- Alpha-numeric secret key for encryption 
- seed
- String for concatenation to raw value 
GroupedShuffler class for applying 'shuffling' transform with data aggregated
Description
Convert self to a list.
Character representation of the class
Arguments
| limit | Minimum number of rows required to shuffle data | 
| data | A data frame to be manipulated | 
| ... | Vector of variables in 'data' to transform. | 
Details
'Shuffling' refers to the a random sampling of a variable without replacement e.g. [A, B, C] becoming [B, A, C] but not [A, A, B]. "Grouped shuffling" refers to aggregating the data by another feature before applying the shuffling process. Grouped shuffling will preserve aggregate level metrics (e.g. mean, median, mode) but removes ordinal properties i.e. correlations and auto-correlations
Fields
- group_on
- Symbolic representation of grouping variables 
- limit
- Minimum number of rows required to shuffle data Create new GroupedShuffler object 
Group numeric data into baskets
Description
Group numeric data into baskets
R6 class for deidentification via random noise
Description
A Deident class dealing with the addition of random noise to a
numeric variable.
Create new Perturber object
Apply noise to a vector of values
Convert self to a list.
Character representation of the class
Arguments
| noise | a single-argument function that applies randomness. | 
| keys | Vector of values to be processed | 
| ... | Values to be concatenated to keys | 
Fields
- noise.str
- character representation of - noise
- method
- random noise function 
Examples
  pert <- Perturber$new()
  pert$transform(1:10)
R6 class for deidentification via replacement
Description
A Deident class dealing with the (repeatable) random replacement of
string for deidentification.
Create new Pseudonymizer object
Check if a key exists in lookup
Check if a key exists in lookup
Retrieve a value from lookup
Returns self$lookup formatted as a tibble
Convert self to a list.
Apply the deidentifcation method to the supplied keys
Arguments
| lookup | a pre-existing name-value pair to define intended psuedonymizations. Instances of 'name' will be replaced with 'value' on transformation. | 
| keys | value to be checked | 
| ... | values to concatenate to  | 
| parse_numerics | True: Force columns to characters. NB: only character vectors will be parsed. | 
Fields
- lookup
- list of mapping from key-value on transform. 
Synthetic data set listing daily shift pattern for fictitious employees
Description
A synthetic data set intended to demonstrate the design and application of a
deidentification pipeline.  Employee names are entirely fictitious and constructed
from the
FiveThirtyEight Most Common Name Dataset.
Usage
ShiftsWorked
Format
A data frame with 3,100 rows and 6 columns:
- Record ID
- Table primary key (integer) 
- Employee
- Name of listed employee 
- Date
- The date being considered 
- Shift
- The shift-type done by - employeeon- date. One of 'Day', 'Night' or 'Rest'.
- Shift Start
- Shift start time (missing if on 'Rest' shift) 
- Shift End
- Shift end time (missing if on 'Rest' shift) 
- Daily Pay
- Shift end time (missing if on 'Rest' shift) 
Shuffler class for applying 'shuffling' transform
Description
Create new Shuffler object
Update minimum vector size for shuffling
Apply the deidentifcation method to the supplied keys
Convert self to a list.
Arguments
| method | [optional] A function representing the method of re-sampling to be used. By default uses exhaustive sampling without replacement. | 
| keys | Value(s) to be transformed. | 
| ... | Value(s) to concatenate to  | 
| limit | integer - the minimum number of observations a variable needs to
have for shuffling to be performed.  If the variable has length less than  | 
Details
'Shuffling' refers to the a random sampling of a variable without replacement e.g. [A, B, C] becoming [B, A, C] but not [A, A, B]. Shuffling will preserve top level metrics (e.g. mean, median, mode) but removes ordinal properties i.e. correlations and auto-correlations
Fields
- limit
- minimum vector length to be shuffled. If vector to be transformed has length < limit, the data is replaced with NAs 
Function factory to apply white noise to a vector proportional to the spread of the data
Description
Function factory to apply white noise to a vector proportional to the spread of the data
Usage
adaptive_noise(sd.ratio = 1/10)
Arguments
| sd.ratio | the level of noise to apply relative to the vectors standard deviation. | 
Value
a function
Examples
f <- adaptive_noise(0.2)
f(1:10)
De-identification via categorical aggregation
Description
add_blur() adds an bluring step to a transformation pipeline
(NB: intended for categorical data).  When ran as a transformation, values
are recoded to a lower cardinality as defined by blur.
#'
Usage
add_blur(object, ..., blur = c())
Arguments
| object | Either a  | 
| ... | variables to be transformed. | 
| blur | a key-value pair such that 'key' is replaced by 'value' on transformation. | 
Value
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
-  deident_methodsa list of each step in the pipeline (consisting ofvariablesandmethod)
and methods:
-  mutateapply the pipeline to a new data set
-  to_yamlserialize the pipeline to a '.yml' file
See Also
category_blur() is provided to aid in defining the blur
Examples
.blur <- category_blur(ShiftsWorked$Shift, `Working` = "Day|Night")
pipe.blur <- add_blur(ShiftsWorked, `Shift`, blur = .blur)
pipe.blur$mutate(ShiftsWorked)
 
De-identification via hash encryption
Description
add_encrypt() adds an encryption step to a transformation pipeline.
When ran as a transformation, each specified variable  undergoes replacement
via  an encryption hashing function depending on the hash_key and seed set.
Usage
add_encrypt(object, ..., hash_key = "", seed = NA)
Arguments
| object | Either a  | 
| ... | variables to be transformed. | 
| hash_key | a random alphanumeric key to control encryption | 
| seed | a random alphanumeric to concat to the value being encrypted | 
Value
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
-  deident_methodsa list of each step in the pipeline (consisting ofvariablesandmethod)
and methods:
-  mutateapply the pipeline to a new data set
-  to_yamlserialize the pipeline to a '.yml' file
Examples
# Basic usage; without setting a `hash_key` or `seed` encryption is poor.
pipe.encrypt <- add_encrypt(ShiftsWorked, Employee)
pipe.encrypt$mutate(ShiftsWorked)
# Once set the encryption is more secure assuming `hash_key` and `seed` are 
# not exposed.
pipe.encrypt.secure <- add_encrypt(ShiftsWorked, Employee, hash_key="hash1", seed="Seed2")
pipe.encrypt.secure$mutate(ShiftsWorked)
Add aggregation to pipelines
Description
add_group() allows for the injection of aggregation into the transformation
pipeline.  Should you need to apply a transformation under aggregation (e.g.
add_shuffle) this helper creates a grouped data.frame as would be done
with dplyr::group_by().
The function add_ungroup() is supplied to perform the inverse operation.
Usage
add_group(object, ...)
add_ungroup(object, ...)
Arguments
| object | Either a  | 
| ... | Variables on which data is to be grouped. | 
Value
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
-  deident_methodsa list of each step in the pipeline (consisting ofvariablesandmethod)
and methods:
-  mutateapply the pipeline to a new data set
-  to_yamlserialize the pipeline to a '.yml' file
Examples
pipe.grouped <- add_group(ShiftsWorked, Date, Shift)
pipe.grouped_shuffle <- add_shuffle(pipe.grouped, `Daily Pay`)
add_ungroup(pipe.grouped_shuffle, `Daily Pay`)
De-identification via numeric aggregation
Description
add_numeric_blur() adds an bluring step to a transformation pipeline
(NB: intended for numeric data).  When ran as a transformation, the data is
split into intervals depending on the cuts supplied of the series
[-Inf, cut.1), [cut.1, cut.2), ..., [cut.n, Inf] where
cuts = c(cut.1, cut.2, ..., cut.n).
Usage
add_numeric_blur(object, ..., cuts = 0)
Arguments
| object | Either a  | 
| ... | variables to be transformed. | 
| cuts | The position in which data is to be divided. | 
Value
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
-  deident_methodsa list of each step in the pipeline (consisting ofvariablesandmethod)
and methods:
-  mutateapply the pipeline to a new data set
-  to_yamlserialize the pipeline to a '.yml' file
De-identification via random noise
Description
add_perturb() adds an perturbation step to a transformation pipeline
(NB: intended for numeric data).  When ran as a transformation, each
specified variable is transformed by the noise function.
Usage
add_perturb(object, ..., noise = adaptive_noise(0.1))
Arguments
| object | Either a  | 
| ... | variables to be transformed. | 
| noise | a single-argument function that applies randomness. | 
Value
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
-  deident_methodsa list of each step in the pipeline (consisting ofvariablesandmethod)
and methods:
-  mutateapply the pipeline to a new data set
-  to_yamlserialize the pipeline to a '.yml' file
See Also
adaptive_noise(), white_noise(), and lognorm_noise()
Examples
pipe.perturb <- add_perturb(ShiftsWorked, `Daily Pay`)
pipe.perturb$mutate(ShiftsWorked)
pipe.perturb.white_noise <- add_perturb(ShiftsWorked, `Daily Pay`, noise=white_noise(0.1))
pipe.perturb.white_noise$mutate(ShiftsWorked)
pipe.perturb.noisy_adaptive <- add_perturb(ShiftsWorked, `Daily Pay`, noise=adaptive_noise(1))
pipe.perturb.noisy_adaptive$mutate(ShiftsWorked)
De-identification via replacement
Description
add_pseudonymize() adds a psuedonymization step to a transformation pipeline.
When ran as a transformation, terms that have not been seen before are given a new
random alpha-numeric string while terms that have been previously transformed
reuse the same term.
Usage
add_pseudonymize(object, ..., lookup = list())
Arguments
| object | Either a  | 
| ... | variables to be transformed. | 
| lookup | a pre-existing name-value pair to define intended psuedonymizations. Instances of 'name' will be replaced with 'value' on transformation.#' | 
Value
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
-  deident_methodsa list of each step in the pipeline (consisting ofvariablesandmethod)
and methods:
-  mutateapply the pipeline to a new data set
-  to_yamlserialize the pipeline to a '.yml' file
Examples
# Basic usage; 
pipe.pseudonymize <- add_pseudonymize(ShiftsWorked, Employee)
pipe.pseudonymize$mutate(ShiftsWorked)
pipe.pseudonymize2 <- add_pseudonymize(ShiftsWorked, Employee, 
                                    lookup=list("Kyle Wilson" = "Kyle"))
pipe.pseudonymize2$mutate(ShiftsWorked)
De-identification via random sampling
Description
add_shuffle() adds a shuffling step to a transformation pipeline.
When ran as a transformation, each specified variable undergoes a random sample without
replacement so that summary metrics  on a single variable are unchanged, but
inter-variable metrics are rendered spurious.
Usage
add_shuffle(object, ..., limit = 0)
Arguments
| object | Either a  | 
| ... | variables to be transformed. | 
| limit | integer - the minimum number of observations a variable needs to
have for shuffling to be performed.  If the variable has length less than  | 
Value
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
-  deident_methodsa list of each step in the pipeline (consisting ofvariablesandmethod)
and methods:
-  mutateapply the pipeline to a new data set
-  to_yamlserialize the pipeline to a '.yml' file
See Also
add_group() for usage under aggregation
Examples
# Basic usage; 
pipe.shuffle <- add_shuffle(ShiftsWorked, Employee)
pipe.shuffle$mutate(ShiftsWorked)
pipe.shuffle.limit <- add_shuffle(ShiftsWorked, Employee, limit=1)
pipe.shuffle.limit$mutate(ShiftsWorked)
Apply a 'deident' pipeline
Description
Applies a pipeline as defined by deident to a data frame. tibble, or file.
Usage
apply_deident(object, deident, ...)
Arguments
| object | The data to be deidentified | 
| deident | A deidentification pipeline to be used. | 
| ... | Terms to be passed to other methods | 
Apply a 'deident' pipeline to a new data frame
Description
Apply a 'deident' pipeline to a new data frame
Usage
apply_to_data_frame(data, transformer, ...)
Arguments
| data | The data set to be converted | 
| transformer | The pipeline to be used | 
| ... | To be passed on to other methods | 
Utility for producing 'blur'
Description
Utility for producing 'blur'
Usage
category_blur(vec, ...)
Arguments
| vec | The vector of values to be used | 
| ... | 
 | 
Create a deident pipeline
Description
Create a deident pipeline
Usage
create_deident(method, ...)
Arguments
| method | A deidentifier to initialize. | 
| ... | list of variables to be deidentifier. NB: key word arguments will be passed to method at initialization. | 
Define a transformation pipeline
Description
deident() creates a transformation pipeline of 'deidentifiers' for
the repeated application of  anonymization transformations.
Usage
deident(data, deidentifier, ...)
Arguments
| data | A data frame, existing pipeline, or a 'deidentifier' (as either initialized object, class generator, or character string) | 
| deidentifier | A deidentifier' (as either initialized object, class generator, or character string) to be appended to the current pipeline | 
| ... | Positional arguments are variables of 'data' to be transformed and key-word arguments are passed to 'deidentifier' at creation | 
Value
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
-  deident_methodsa list of each step in the pipeline (consisting ofvariablesandmethod)
and methods:
-  mutateapply the pipeline to a new data set
-  to_yamlserialize the pipeline to a '.yml' file
Examples
# 
pipe <- deident(ShiftsWorked, Pseudonymizer, Employee)
print(pipe)
apply_deident(ShiftsWorked, pipe)
  
Apply a pipeline to files on disk.
Description
Apply a deident pipeline to a set of files and save them back to disk
Usage
deident_job_from_folder(
  deident_pipeline,
  data_dir,
  result_dir = "Deident_results"
)
Arguments
| deident_pipeline | The deident list to be used. | 
| data_dir | a path to the files to be transformed. | 
| result_dir | a path to where files are to be saved. | 
Restore a serialized deident from file
Description
Restore a serialized deident from file
Usage
from_yaml(path)
Arguments
| path | Path to serialized deident. | 
Examples
deident <- deident(ShiftsWorked, Pseudonymizer, Employee)
.tempfile <- tempfile(fileext = ".yml")
deident$to_yaml(.tempfile)
deident.yaml <- from_yaml(.tempfile)
deident.yaml$mutate(ShiftsWorked)
Function factory to apply log-normal noise to a vector
Description
Function factory to apply log-normal noise to a vector
Usage
lognorm_noise(sd = 0.1)
Arguments
| sd | the standard deviation of noise to apply. | 
Value
a function
Examples
f <- lognorm_noise(1)
f(1:10)
Deidentification API root
Description
A general function for defining a deident function.
Usage
new_deident(object, ..., encrypter)
Arguments
| object | Either a  | 
| ... | variables to be transformed. | 
Value
A 'DeidentList' representing the untrained transformation pipeline. The object contains fields:
-  deident_methodsa list of each step in the pipeline (consisting ofvariablesandmethod)
and methods:
-  mutateapply the pipeline to a new data set
-  to_yamlserialize the pipeline to a '.yml' file
Starwars characters
Description
The original data, from SWAPI, the Star Wars API, https://swapi.py4e.com/, has been revised
to reflect additional research into gender and sex determinations of characters.  NB: taken from dplyr
Usage
starwars
Format
A tibble with 87 rows and 14 variables:
- name
- Name of the character 
- height
- Height (cm) 
- mass
- Weight (kg) 
- hair_color,skin_color,eye_color
- Hair, skin, and eye colors 
- birth_year
- Year born (BBY = Before Battle of Yavin) 
- sex
- The biological sex of the character, namely male, female, hermaphroditic, or none (as in the case for Droids). 
- gender
- The gender role or gender identity of the character as determined by their personality or the way they were programmed (as in the case for Droids). 
- homeworld
- Name of homeworld 
- species
- Name of species 
- films
- List of films the character appeared in 
- vehicles
- List of vehicles the character has piloted 
- starships
- List of starships the character has piloted 
Examples
starwars
Tidy eval helpers
Description
This page lists the tidy eval tools reexported in this package from rlang. To learn about using tidy eval in scripts and packages at a high level, see the dplyr programming vignette and the ggplot2 in packages vignette. The Metaprogramming section of Advanced R may also be useful for a deeper dive.
- The tidy eval operators - {{,- !!, and- !!!are syntactic constructs which are specially interpreted by tidy eval functions. You will mostly need- {{, as- !!and- !!!are more advanced operators which you should not have to use in simple cases.- The curly-curly operator - {{allows you to tunnel data-variables passed from function arguments inside other tidy eval functions.- {{is designed for individual arguments. To pass multiple arguments contained in dots, use- ...in the normal way.- my_function <- function(data, var, ...) { data %>% group_by(...) %>% summarise(mean = mean({{ var }})) }
-  enquo()andenquos()delay the execution of one or several function arguments. The former returns a single expression, the latter returns a list of expressions. Once defused, expressions will no longer evaluate on their own. They must be injected back into an evaluation context with!!(for a single expression) and!!!(for a list of expressions).my_function <- function(data, var, ...) { # Defuse var <- enquo(var) dots <- enquos(...) # Inject data %>% group_by(!!!dots) %>% summarise(mean = mean(!!var)) }In this simple case, the code is equivalent to the usage of {{and...above. Defusing withenquo()orenquos()is only needed in more complex cases, for instance if you need to inspect or modify the expressions in some way.
- The - .datapronoun is an object that represents the current slice of data. If you have a variable name in a string, use the- .datapronoun to subset that variable with- [[.- my_var <- "disp" mtcars %>% summarise(mean = mean(.data[[my_var]])) 
- Another tidy eval operator is - :=. It makes it possible to use glue and curly-curly syntax on the LHS of- =. For technical reasons, the R language doesn't support complex expressions on the left of- =, so we use- :=as a workaround.- my_function <- function(data, var, suffix = "foo") { # Use `{{` to tunnel function arguments and the usual glue # operator `{` to interpolate plain strings. data %>% summarise("{{ var }}_mean_{suffix}" := mean({{ var }})) }
- Many tidy eval functions like - dplyr::mutate()or- dplyr::summarise()give an automatic name to unnamed inputs. If you need to create the same sort of automatic names by yourself, use- as_label(). For instance, the glue-tunnelling syntax above can be reproduced manually with:- my_function <- function(data, var, suffix = "foo") { var <- enquo(var) prefix <- as_label(var) data %>% summarise("{prefix}_mean_{suffix}" := mean(!!var)) }- Expressions defused with - enquo()(or tunnelled with- {{) need not be simple column names, they can be arbitrarily complex.- as_label()handles those cases gracefully. If your code assumes a simple column name, use- as_name()instead. This is safer because it throws an error if the input is not a name as expected.
Function factory to apply white noise to a vector
Description
Function factory to apply white noise to a vector
Usage
white_noise(sd = 0.1)
Arguments
| sd | the standard deviation of noise to apply. | 
Value
a function
Examples
f <- white_noise(1)
f(1:10)