library(vazul)
library(dplyr)The vazul package provides functions for data blinding in research contexts. Data blinding helps prevent researcher bias by anonymizing data while preserving analytical validity. This vignette introduces the main functions and demonstrates their usage with practical examples.
There are two primary approaches to data blinding:
Each approach is available at three levels:
mask_labels() and scramble_values() - operate on single vectorsmask_variables() and scramble_variables() - operate on columns in a data framemask_variables_rowwise() and scramble_variables_rowwise() - operate within rows across columnslibrary(vazul)
library(dplyr)Masking functions replace categorical values with anonymous labels. This is useful when you want to completely hide the original information, such as treatment conditions or group assignments.
mask_labels() - Mask Vector ValuesThe mask_labels() function takes a character or factor vector and replaces each unique value with a randomly assigned masked label.
x: A character or factor vector to maskprefix: Character string to use as prefix for masked labels (default: "masked_group_")# Create a simple treatment vector
treatment <- c("control", "treatment", "control", "treatment", "control")
# Mask the labels
set.seed(123)
masked_treatment <- mask_labels(treatment)
masked_treatment
#> control treatment control treatment
#> "masked_group_01" "masked_group_02" "masked_group_01" "masked_group_02"
#> control
#> "masked_group_01"Notice that:
You can customize the prefix used for masked labels:
set.seed(456)
mask_labels(treatment, prefix = "group_")
#> control treatment control treatment control
#> "group_01" "group_02" "group_01" "group_02" "group_01"set.seed(789)
mask_labels(treatment, prefix = "condition_")
#> control treatment control treatment control
#> "condition_01" "condition_02" "condition_01" "condition_02" "condition_01"The function preserves factor structure when the input is a factor:
# Create a factor vector
ecology <- factor(c("Desperate", "Hopeful", "Desperate", "Hopeful"))
set.seed(123)
masked_ecology <- mask_labels(ecology)
masked_ecology
#> Desperate Hopeful Desperate Hopeful
#> masked_group_01 masked_group_02 masked_group_01 masked_group_02
#> Levels: masked_group_01 masked_group_02
class(masked_ecology)
#> [1] "factor"Let’s use the williams dataset to mask the ecology condition:
data(williams)
set.seed(42)
williams$ecology_masked <- mask_labels(williams$ecology)
# Compare original and masked values
head(williams[c("subject", "ecology", "ecology_masked")], 10)
#> # A tibble: 10 × 3
#> subject ecology ecology_masked
#> <chr> <chr> <chr>
#> 1 A30MP4LXV4MIFD Hopeful masked_group_01
#> 2 A16X5FB3HAFCKN Desperate masked_group_02
#> 3 A1E9D1OT9VJYDZ Desperate masked_group_02
#> 4 A16FPOYD7566WI Hopeful masked_group_01
#> 5 A11NOTVHWST7Y3 Desperate masked_group_02
#> 6 A3TDR6MXS6UO5Z Desperate masked_group_02
#> 7 A3OD4F0SA7EBCL Desperate masked_group_02
#> 8 A123PBQDU71I5O Hopeful masked_group_01
#> 9 A25NGIY591U3DK Hopeful masked_group_01
#> 10 A11WCFPJSR5VZP Desperate masked_group_02Now researchers can analyze the data without knowing which condition is “Desperate” vs “Hopeful”.
mask_variables() - Mask Data Frame ColumnsThe mask_variables() function applies masking to multiple columns in a data frame simultaneously.
data: A data frame...: Columns to mask (supports tidyselect helpers)across_variables: If TRUE, all selected variables share the same masked labels; if FALSE (default), each variable gets independent masked labelsBy default, each column gets its own set of masked labels with the column name as prefix:
df <- data.frame(
treatment = c("control", "intervention", "control", "intervention"),
outcome = c("success", "failure", "success", "failure"),
score = c(85, 92, 78, 88)
)
set.seed(123)
result <- mask_variables(df, c("treatment", "outcome"))
result
#> treatment outcome score
#> 1 treatment_group_01 outcome_group_01 85
#> 2 treatment_group_02 outcome_group_02 92
#> 3 treatment_group_01 outcome_group_01 78
#> 4 treatment_group_02 outcome_group_02 88Notice that each column now has its own prefix (treatment_group_, outcome_group_).
When across_variables = TRUE, all selected columns share the same mapping:
df2 <- data.frame(
pre_condition = c("A", "B", "C", "A"),
post_condition = c("B", "A", "A", "C"),
score = c(1, 2, 3, 4)
)
set.seed(456)
result_shared <- mask_variables(df2, c("pre_condition", "post_condition"),
across_variables = TRUE)
result_shared
#> pre_condition post_condition score
#> 1 masked_group_01 masked_group_03 1
#> 2 masked_group_03 masked_group_01 2
#> 3 masked_group_02 masked_group_01 3
#> 4 masked_group_01 masked_group_02 4With shared masking, value “A” maps to the same label in both columns.
You can use tidyselect helpers to select columns:
set.seed(789)
mask_variables(df, where(is.character))
#> treatment outcome score
#> 1 treatment_group_01 outcome_group_02 85
#> 2 treatment_group_02 outcome_group_01 92
#> 3 treatment_group_01 outcome_group_02 78
#> 4 treatment_group_02 outcome_group_01 88mask_variables_rowwise() - Row-Level MaskingThe mask_variables_rowwise() function applies consistent masking within each row across multiple columns. This is useful when you have repeated measures or matched conditions.
data: A data frame...: Column sets to mask (supports tidyselect helpers)prefix: Character string to use as prefix for masked labels (default: "masked_group_")df <- data.frame(
treat_1 = c("control", "treatment", "placebo"),
treat_2 = c("treatment", "placebo", "control"),
treat_3 = c("placebo", "control", "treatment"),
id = 1:3
)
set.seed(123)
result <- mask_variables_rowwise(df, starts_with("treat_"))
result
#> treat_1 treat_2 treat_3 id
#> 1 masked_group_03 masked_group_01 masked_group_02 1
#> 2 masked_group_01 masked_group_02 masked_group_03 2
#> 3 masked_group_02 masked_group_03 masked_group_01 3Within each row, the original values are consistently mapped to masked labels, but the mapping is independent across rows.
Scrambling functions randomize the order of values while preserving all original data content. This approach maintains the data distribution while breaking the connection between observations and their original values.
scramble_values() - Scramble Vector OrderThe scramble_values() function randomly reorders the elements of a vector.
x: A vector to scramble# Numeric data
set.seed(123)
numbers <- 1:10
scramble_values(numbers)
#> [1] 3 10 2 8 6 9 1 7 5 4# Character data
set.seed(456)
letters_vec <- letters[1:5]
scramble_values(letters_vec)
#> [1] "e" "a" "c" "b" "d"# Factor data
set.seed(789)
conditions <- factor(c("A", "B", "C", "A", "B"))
scramble_values(conditions)
#> [1] B A B C A
#> Levels: A B CScrambling preserves:
set.seed(100)
original <- c(1, 2, 2, 3, 3, 3, 4, 4, 4, 4)
scrambled <- scramble_values(original)
# Same values, different order
sort(original) == sort(scrambled)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Same frequency distribution
table(original)
#> original
#> 1 2 3 4
#> 1 2 3 4
table(scrambled)
#> scrambled
#> 1 2 3 4
#> 1 2 3 4data(williams)
set.seed(42)
williams$age_scrambled <- scramble_values(williams$age)
# The values are the same, just reordered
summary(williams$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 21.00 26.00 32.00 34.04 38.00 71.00
summary(williams$age_scrambled)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 21.00 26.00 32.00 34.04 38.00 71.00
# But individual correspondences are broken
head(williams[c("subject", "age", "age_scrambled")], 10)
#> # A tibble: 10 × 3
#> subject age age_scrambled
#> <chr> <dbl> <dbl>
#> 1 A30MP4LXV4MIFD 34 25
#> 2 A16X5FB3HAFCKN 30 26
#> 3 A1E9D1OT9VJYDZ 40 25
#> 4 A16FPOYD7566WI 35 38
#> 5 A11NOTVHWST7Y3 26 25
#> 6 A3TDR6MXS6UO5Z 33 28
#> 7 A3OD4F0SA7EBCL 33 57
#> 8 A123PBQDU71I5O 30 32
#> 9 A25NGIY591U3DK 48 25
#> 10 A11WCFPJSR5VZP 33 43scramble_variables() - Scramble Data Frame ColumnsThe scramble_variables() function scrambles the values of specified columns in a data frame.
data: A data frame...: Columns to scramble (supports tidyselect helpers)together: If TRUE, variables are scrambled together as a unit per row; if FALSE (default), each variable is scrambled independently.groups: Optional grouping columns for within-group scrambling. Grouping columns must not overlap with the columns selected in .... If data is already grouped (a dplyr grouped data frame), existing grouping is ignored unless .groups is explicitly provided.Each column is scrambled independently:
df <- data.frame(
x = 1:6,
y = letters[1:6],
group = c("A", "A", "A", "B", "B", "B")
)
set.seed(123)
scramble_variables(df, c("x", "y"))
#> x y group
#> 1 3 e A
#> 2 6 d A
#> 3 2 b A
#> 4 4 f B
#> 5 5 a B
#> 6 1 c BNotice that x and y are scrambled independently of each other.
When together = TRUE, the selected columns are scrambled as a unit, preserving row-level relationships:
set.seed(456)
scramble_variables(df, c("x", "y"), together = TRUE)
#> x y group
#> 1 5 e A
#> 2 6 f A
#> 3 3 c A
#> 4 2 b B
#> 5 1 a B
#> 6 4 d BNotice that the pairs (1, “a”), (2, “b”), etc., remain intact but are assigned to different rows.
Use the .groups parameter to scramble within groups:
set.seed(2)
scramble_variables(df, "x", .groups = "group")
#> # A tibble: 6 × 3
#> x y group
#> <int> <chr> <chr>
#> 1 1 a A
#> 2 3 b A
#> 3 2 c A
#> 4 5 d B
#> 5 6 e B
#> 6 4 f BValues of x are only swapped within their original group (A or B).
You can combine both parameters:
set.seed(100)
scramble_variables(df, c("x", "y"), .groups = "group", together = TRUE)
#> # A tibble: 6 × 3
#> x y group
#> <int> <chr> <chr>
#> 1 2 b A
#> 2 1 a A
#> 3 3 c A
#> 4 6 f B
#> 5 4 d B
#> 6 5 e Bdata(williams)
# Scramble age and ecology within gender groups
set.seed(42)
williams_scrambled <- williams |>
scramble_variables(c("age", "ecology"), .groups = "gender")
# Check that values are preserved within groups
williams |>
group_by(gender) |>
summarise(mean_age = mean(age, na.rm = TRUE))
#> # A tibble: 2 × 2
#> gender mean_age
#> <dbl> <dbl>
#> 1 1 33.8
#> 2 2 34.6
williams_scrambled |>
group_by(gender) |>
summarise(mean_age = mean(age, na.rm = TRUE))
#> # A tibble: 2 × 2
#> gender mean_age
#> <dbl> <dbl>
#> 1 1 33.8
#> 2 2 34.6scramble_variables_rowwise() - Row-Level ScramblingThe scramble_variables_rowwise() function scrambles values within each row across specified columns. This is useful for scrambling repeated measures or item responses.
data: A data frame...: Columns to scramble (supports tidyselect helpers). All selections are combined into a single set and scrambled together. If you want to scramble separate groups of columns independently, call the function multiple times.Rowwise scrambling moves values between columns, so selected columns must be type-compatible. This function requires all selected columns to have the same class (or be an integer/double mix). For factors, the selected columns must also have identical levels.
df <- data.frame(
item1 = c(1, 4, 7),
item2 = c(2, 5, 8),
item3 = c(3, 6, 9),
id = 1:3
)
set.seed(123)
result <- scramble_variables_rowwise(df, c("item1", "item2", "item3"))
result
#> item1 item2 item3 id
#> 1 3 1 2 1
#> 2 5 4 6 2
#> 3 8 9 7 3Within each row, the values are shuffled among the item columns.
Multiple selectors are combined into one set, so values can move between all selected columns:
df2 <- data.frame(
day_1 = c(1, 4, 7),
day_2 = c(2, 5, 8),
day_3 = c(3, 6, 9),
score_a = c(10, 40, 70),
score_b = c(20, 50, 80),
id = 1:3
)
set.seed(2)
result2 <- scramble_variables_rowwise(df2, starts_with("day_"), starts_with("score_"))
result2
#> day_1 day_2 day_3 score_a score_b id
#> 1 1 3 2 20 10 1
#> 2 5 6 4 40 50 2
#> 3 7 9 8 70 80 3To scramble different groups of columns independently, call the function multiple times:
set.seed(42)
result3 <- df2 |>
scramble_variables_rowwise(starts_with("day_")) |>
scramble_variables_rowwise(starts_with("score_"))
result3
#> day_1 day_2 day_3 score_a score_b id
#> 1 1 3 2 20 10 1
#> 2 4 5 6 50 40 2
#> 3 8 9 7 80 70 3All masking functions preserve NA values in their original positions:
# Vector with NA values
x <- c("A", "B", NA, "A", NA, "C")
set.seed(123)
masked_x <- mask_labels(x)
masked_x
#> A B <NA> A
#> "masked_group_03" "masked_group_04" NA "masked_group_03"
#> <NA> C
#> NA "masked_group_02"
# NA positions are preserved
which(is.na(masked_x))
#> <NA> <NA>
#> 3 5If all values in a vector are NA, the function will issue a warning and return the vector unchanged:
x_all_na <- c(NA_character_, NA_character_, NA_character_)
mask_labels(x_all_na)
#> <NA> <NA> <NA>
#> NA NA NAEmpty strings ("") are treated as valid categorical values and will be masked like any other value:
x_with_empty <- c("A", "", "B", "", "C")
set.seed(456)
masked_with_empty <- mask_labels(x_with_empty)
masked_with_empty
#> A <NA> B <NA>
#> "masked_group_01" NA "masked_group_03" NA
#> C
#> "masked_group_02"
# Empty strings get their own masked label
unique(masked_with_empty)
#> [1] "masked_group_01" NA "masked_group_03" "masked_group_02"This is different from NA values - empty strings are actual data values, not missing data.
| Aspect | Masking | Scrambling |
|---|---|---|
| Original values | Hidden (replaced) | Preserved (reordered) |
| Distribution | Changed (new labels) | Unchanged |
| Best for | Categorical variables | Numeric or categorical |
| Use case | Hide treatment conditions | Break individual links |
| Reversibility | Requires mapping key | Irreversible |
The vazul package includes two research datasets for demonstration and practice.
The Many Analysts Religion Project (MARP) dataset contains 10,535 participants from 24 countries:
data(marp)
dim(marp)
#> [1] 10535 46
# Example: Scramble religiosity scores within countries
set.seed(42)
marp_blinded <- marp |>
scramble_variables(starts_with("rel_"), .groups = "country")
# Original and scrambled have same country-level means
original_means <- marp |>
group_by(country) |>
summarise(rel_1_mean = mean(rel_1, na.rm = TRUE), .groups = "drop")
scrambled_means <- marp_blinded |>
group_by(country) |>
summarise(rel_1_mean = mean(rel_1, na.rm = TRUE), .groups = "drop")
all.equal(original_means$rel_1_mean, scrambled_means$rel_1_mean)
#> [1] "Mean relative difference: 0.2219757"The Williams study dataset contains 112 participants from a stereotyping study:
data(williams)
dim(williams)
#> [1] 112 25
# Example: Mask the ecology condition for blind analysis
set.seed(42)
williams_blinded <- williams |>
mask_variables("ecology")
# Analysts can work with masked conditions
williams_blinded |>
group_by(ecology) |>
summarise(
n = n(),
mean_impulsivity = mean(Impuls_1, na.rm = TRUE),
.groups = "drop"
)
#> # A tibble: 2 × 3
#> ecology n mean_impulsivity
#> <chr> <int> <dbl>
#> 1 ecology_group_01 56 4.32
#> 2 ecology_group_02 56 4.61The vazul package provides a comprehensive toolkit for data blinding:
| Function | Level | Purpose |
|---|---|---|
mask_labels() |
Vector | Replace categorical values with anonymous labels |
mask_variables() |
Data frame | Mask multiple columns |
mask_variables_rowwise() |
Row-wise | Consistent masking within rows |
scramble_values() |
Vector | Randomize value order |
scramble_variables() |
Data frame | Scramble multiple columns |
scramble_variables_rowwise() |
Row-wise | Scramble values within rows |
These functions help researchers conduct unbiased analyses by separating the analyst from knowledge about treatment conditions, group assignments, or individual data points.