Data Blinding with vazul

Introduction

The vazul package provides functions for data blinding in research contexts. Data blinding helps prevent researcher bias by anonymizing data while preserving analytical validity. This vignette introduces the main functions and demonstrates their usage with practical examples.

There are two primary approaches to data blinding:

Masking: Replaces original values with anonymous labels, completely hiding the original information.
Scrambling: Randomizes the order of existing values while preserving all original data content.

Each approach is available at three levels:

Vector level: mask_labels() and scramble_values() - operate on single vectors
Data frame level: mask_variables() and scramble_variables() - operate on columns in a data frame
Row-wise level: scramble_variables(..., .byrow = TRUE) - operates within rows across columns

library(vazul)
library(dplyr)

Masking Functions

Masking functions replace categorical values with anonymous labels. This is useful when you want to completely hide the original information, such as treatment conditions or group assignments.

`mask_labels()` - Mask Vector Values

The mask_labels() function takes a character or factor vector and replaces each unique value with a randomly assigned masked label.

Parameters

x: A character or factor vector to mask
prefix: Character string to use as prefix for masked labels (default: "masked_group_")

Basic Usage

# Create a simple treatment vector
treatment <- c("control", "treatment", "control", "treatment", "control")

# Mask the labels
set.seed(1037)
masked_treatment <- mask_labels(treatment)
masked_treatment
#> [1] "masked_group_01" "masked_group_02" "masked_group_01" "masked_group_02"
#> [5] "masked_group_01"

Notice that:

Each unique value receives a unique masked label
The same original value always maps to the same masked label
The assignment of masked labels to original values is randomized

Custom Prefix

You can customize the prefix used for masked labels:

set.seed(456)
mask_labels(treatment, prefix = "group_")
#> [1] "group_01" "group_02" "group_01" "group_02" "group_01"

set.seed(789)
mask_labels(treatment, prefix = "condition_")
#> [1] "condition_01" "condition_02" "condition_01" "condition_02" "condition_01"

Working with Factors

The function preserves factor structure when the input is a factor:

# Create a factor vector
ecology <- factor(c("Desperate", "Hopeful", "Desperate", "Hopeful"))

set.seed(123)
masked_ecology <- mask_labels(ecology)
masked_ecology
#> [1] masked_group_01 masked_group_02 masked_group_01 masked_group_02
#> Levels: masked_group_01 masked_group_02
class(masked_ecology)
#> [1] "factor"

Practical Example with Dataset

Let’s use the williams dataset to mask the ecology condition:

data(williams)

set.seed(42)
williams$ecology_masked <- mask_labels(williams$ecology)

# Compare original and masked values
head(williams[c("subject", "ecology", "ecology_masked")], 10)
#> # A tibble: 10 × 3
#>    subject        ecology   ecology_masked 
#>    <chr>          <chr>     <chr>          
#>  1 A30MP4LXV4MIFD Hopeful   masked_group_01
#>  2 A16X5FB3HAFCKN Desperate masked_group_02
#>  3 A1E9D1OT9VJYDZ Desperate masked_group_02
#>  4 A16FPOYD7566WI Hopeful   masked_group_01
#>  5 A11NOTVHWST7Y3 Desperate masked_group_02
#>  6 A3TDR6MXS6UO5Z Desperate masked_group_02
#>  7 A3OD4F0SA7EBCL Desperate masked_group_02
#>  8 A123PBQDU71I5O Hopeful   masked_group_01
#>  9 A25NGIY591U3DK Hopeful   masked_group_01
#> 10 A11WCFPJSR5VZP Desperate masked_group_02

Now researchers can analyze the data without knowing which condition is “Desperate” vs “Hopeful”.

`mask_variables()` - Mask Data Frame Columns

The mask_variables() function applies masking to multiple columns in a data frame simultaneously.

Parameters

data: A data frame
...: Columns to mask (supports tidyselect helpers)
.across_variables: If TRUE, all selected variables share the same masked labels; if FALSE (default), each variable gets independent masked labels

Independent Masking (Default)

By default, each column gets its own set of masked labels with the column name as prefix:

df <- data.frame(
  treatment = c("control", "intervention", "control", "intervention"),
  outcome = c("success", "failure", "success", "failure"),
  score = c(85, 92, 78, 88)
)

set.seed(123)
result <- mask_variables(df, c("treatment", "outcome"))
result
#>            treatment          outcome score
#> 1 treatment_group_01 outcome_group_01    85
#> 2 treatment_group_02 outcome_group_02    92
#> 3 treatment_group_01 outcome_group_01    78
#> 4 treatment_group_02 outcome_group_02    88

Notice that each column now has its own prefix (treatment_group_, outcome_group_).

Shared Masking Across Variables

When .across_variables = TRUE, all selected columns share the same mapping:

df2 <- data.frame(
  pre_condition = c("A", "B", "C", "A"),
  post_condition = c("B", "A", "A", "C"),
  score = c(1, 2, 3, 4)
)

set.seed(456)
result_shared <- mask_variables(df2, c("pre_condition", "post_condition"),
                                .across_variables = TRUE)
result_shared
#>     pre_condition  post_condition score
#> 1 masked_group_01 masked_group_03     1
#> 2 masked_group_03 masked_group_01     2
#> 3 masked_group_02 masked_group_01     3
#> 4 masked_group_01 masked_group_02     4

With shared masking, value “A” maps to the same label in both columns.

Using tidyselect Helpers

You can use tidyselect helpers to select columns:

set.seed(789)
mask_variables(df, where(is.character))
#>            treatment          outcome score
#> 1 treatment_group_01 outcome_group_02    85
#> 2 treatment_group_02 outcome_group_01    92
#> 3 treatment_group_01 outcome_group_02    78
#> 4 treatment_group_02 outcome_group_01    88

Scrambling Functions

Scrambling functions randomize the order of values while preserving all original data content. This approach maintains the data distribution while breaking the connection between observations and their original values.

`scramble_values()` - Scramble Vector Order

The scramble_values() function randomly reorders the elements of a vector.

Parameters

x: A vector to scramble

Basic Usage with Different Data Types

# Numeric data
set.seed(123)
numbers <- 1:10
scramble_values(numbers)
#>  [1]  3 10  2  8  6  9  1  7  5  4

# Character data
set.seed(456)
letters_vec <- letters[1:5]
scramble_values(letters_vec)
#> [1] "e" "a" "c" "b" "d"

# Factor data
set.seed(789)
conditions <- factor(c("A", "B", "C", "A", "B"))
scramble_values(conditions)
#> [1] B A B C A
#> Levels: A B C

Key Properties

Scrambling preserves:

All original values (nothing is lost or changed)
The data type
The distribution of values

set.seed(1037)
original <- c(1, 2, 2, 3, 3, 3, 4, 4, 4, 4)
scrambled <- scramble_values(original)

# The scrambled vector has the same values, just in a different order 
scrambled
#>  [1] 4 3 2 1 4 2 3 3 4 4

# Same values, different order
sort(original) == sort(scrambled)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# Same frequency distribution
table(original)
#> original
#> 1 2 3 4 
#> 1 2 3 4
table(scrambled)
#> scrambled
#> 1 2 3 4 
#> 1 2 3 4

`scramble_variables()` - Scramble Data Frame Columns

The scramble_variables() function scrambles the values of specified columns in a data frame.

Parameters

data: A data frame
...: Columns to scramble (supports tidyselect helpers)
.together: If TRUE, variables are scrambled together as a unit per row; if FALSE (default), each variable is scrambled independently
.groups: Optional grouping columns for within-group scrambling. Grouping columns must not overlap with the columns selected in .... If data is already grouped (a dplyr grouped data frame), existing grouping is ignored unless .groups is explicitly provided.

Independent Scrambling (Default)

Each column is scrambled independently:

df <- data.frame(
  x = 1:6,
  y = letters[1:6],
  group = c("A", "A", "A", "B", "B", "B")
)

set.seed(123)
scramble_variables(df, c("x", "y"))
#>   x y group
#> 1 3 e     A
#> 2 6 d     A
#> 3 2 b     A
#> 4 4 f     B
#> 5 5 a     B
#> 6 1 c     B

Notice that x and y are scrambled independently of each other.

Scrambling Together

When .together = TRUE, the selected columns are scrambled as a unit, preserving row-level relationships:

set.seed(456)
scramble_variables(df, c("x", "y"), .together = TRUE)
#>   x y group
#> 1 5 e     A
#> 2 6 f     A
#> 3 3 c     A
#> 4 2 b     B
#> 5 1 a     B
#> 6 4 d     B

Notice that the pairs (1, “a”), (2, “b”), etc., remain intact but are assigned to different rows.

Within-Group Scrambling

Use the .groups parameter to scramble within groups:

set.seed(2)
scramble_variables(df, "x", .groups = "group")
#> # A tibble: 6 × 3
#>       x y     group
#>   <int> <chr> <chr>
#> 1     1 a     A    
#> 2     3 b     A    
#> 3     2 c     A    
#> 4     5 d     B    
#> 5     6 e     B    
#> 6     4 f     B

Values of x are only swapped within their original group (A or B).

Combining Grouping and Together

You can combine both parameters:

set.seed(100)
scramble_variables(df, c("x", "y"), .groups = "group", .together = TRUE)
#> # A tibble: 6 × 3
#>       x y     group
#>   <int> <chr> <chr>
#> 1     2 b     A    
#> 2     1 a     A    
#> 3     3 c     A    
#> 4     6 f     B    
#> 5     4 d     B    
#> 6     5 e     B

Practical Example with Dataset

data(williams)

# Scramble age and ecology within gender groups
set.seed(42)
williams_scrambled <- williams |>
  scramble_variables(c("age", "ecology"), .groups = "gender")

# Check that values are preserved within groups
williams |>
  group_by(gender) |>
  summarise(mean_age = mean(age, na.rm = TRUE))
#> # A tibble: 2 × 2
#>   gender mean_age
#>    <dbl>    <dbl>
#> 1      1     33.8
#> 2      2     34.6

williams_scrambled |>
  group_by(gender) |>
  summarise(mean_age = mean(age, na.rm = TRUE))
#> # A tibble: 2 × 2
#>   gender mean_age
#>    <dbl>    <dbl>
#> 1      1     33.8
#> 2      2     34.6

`scramble_variables(..., .byrow = TRUE)` - Row-Level Scrambling

The scramble_variables(..., .byrow = TRUE) function scrambles values within each row across specified columns. This is useful for scrambling repeated measures or item responses.

Parameters

data: A data frame
...: Columns to scramble (supports tidyselect helpers). All selections are combined into a single set and scrambled together. If you want to scramble separate groups of columns independently, call the function multiple times.
.byrow: Must be set to TRUE for row-wise scrambling.

Rowwise scrambling moves values between columns, so selected columns must be type-compatible. This function requires all selected columns to have the same class (or be an integer/double mix). For factors, the selected columns must also have identical levels.

Example: Scrambling Item Responses

df <- data.frame(
  item1 = c(1, 4, 7),
  item2 = c(2, 5, 8),
  item3 = c(3, 6, 9),
  id = 1:3
)

set.seed(123)
result <- scramble_variables(df, c("item1", "item2", "item3"), .byrow = TRUE)
result
#>   item1 item2 item3 id
#> 1     3     1     2  1
#> 2     5     4     6  2
#> 3     8     9     7  3

Within each row, the values are shuffled among the item columns.

Combining Multiple Selectors (Single Combined Set)

Multiple selectors are combined into one set, so values can move between all selected columns:

df2 <- data.frame(
  day_1 = c(1, 4, 7),
  day_2 = c(2, 5, 8),
  day_3 = c(3, 6, 9),
  score_a = c(10, 40, 70),
  score_b = c(20, 50, 80),
  id = 1:3
)

set.seed(2)
result2 <- scramble_variables(df2, starts_with("day_"), starts_with("score_"), .byrow = TRUE)
result2
#>   day_1 day_2 day_3 score_a score_b id
#> 1    20     3     2      10       1  1
#> 2     4    50    40       5       6  2
#> 3     7     8     9      80      70  3

Scrambling Separate Groups Independently (Call Multiple Times)

To scramble different groups of columns independently, call the function multiple times:

set.seed(42)
result3 <- df2 |>
  scramble_variables(starts_with("day_"), .byrow = TRUE) |>
  scramble_variables(starts_with("score_"), .byrow = TRUE)
result3
#>   day_1 day_2 day_3 score_a score_b id
#> 1     1     3     2      20      10  1
#> 2     4     5     6      50      40  2
#> 3     8     9     7      80      70  3

Handling Special Values

Missing Values (NA)

All masking functions preserve NA values in their original positions:

# Vector with NA values
x <- c("A", "B", NA, "A", NA, "C")

set.seed(123)
masked_x <- mask_labels(x)
masked_x
#> [1] "masked_group_03" "masked_group_01" NA                "masked_group_03"
#> [5] NA                "masked_group_02"

# NA positions are preserved
which(is.na(masked_x))
#> [1] 3 5

If all values in a vector are NA, the function will issue a warning and return the vector unchanged:

x_all_na <- c(NA_character_, NA_character_, NA_character_)
mask_labels(x_all_na)
#> Warning: All values in input are NA. Returning unchanged.
#> [1] NA NA NA

Empty Strings

Empty strings ("") are treated as valid categorical values and will be masked like any other value:

x_with_empty <- c("A", "", "B", "", "C")

set.seed(456)
masked_with_empty <- mask_labels(x_with_empty)
masked_with_empty
#> [1] "masked_group_01" "masked_group_04" "masked_group_03" "masked_group_04"
#> [5] "masked_group_02"

# Empty strings get their own masked label
unique(masked_with_empty)
#> [1] "masked_group_01" "masked_group_04" "masked_group_03" "masked_group_02"

This is different from NA values - empty strings are actual data values, not missing data.

Choosing Between Masking and Scrambling

Aspect	Masking	Scrambling
Original values	Hidden (replaced)	Preserved (reordered)
Distribution	Changed (new labels)	Unchanged
Best for	Categorical variables	Numeric or categorical
Use case	Hide treatment conditions	Break individual links
Reversibility	Requires mapping key	Irreversible

When to Use Masking

When you need to hide categorical labels (e.g., treatment conditions, group names)
When analysts should not know the meaning of categories
When you want different prefixes for different variables

When to Use Scrambling

When you want to preserve the original data distribution
When you need to break the link between observations and values
When working with numeric data that shouldn’t be categorically relabeled

Working with Included Datasets

The vazul package includes two research datasets for demonstration and practice.

MARP Dataset

The Many Analysts Religion Project (MARP) dataset contains 10,535 participants from 24 countries:

data(marp)
dim(marp)
#> [1] 10535    46

# Example: Scramble religiosity scores within countries
set.seed(42)
marp_blinded <- marp |>
  scramble_variables(starts_with("rel_"), .groups = "country")

# Original and scrambled have same country-level means
original_means <- marp |>
  group_by(country) |>
  summarise(rel_1_mean = mean(rel_1, na.rm = TRUE), .groups = "drop")

scrambled_means <- marp_blinded |>
  group_by(country) |>
  summarise(rel_1_mean = mean(rel_1, na.rm = TRUE), .groups = "drop")

all.equal(original_means$rel_1_mean, scrambled_means$rel_1_mean)
#> [1] TRUE

Williams Dataset

The Williams study dataset contains 112 participants from a stereotyping study:

data(williams)
dim(williams)
#> [1] 112  25

# Example: Mask the ecology condition for blind analysis
set.seed(42)
williams_blinded <- williams |>
  mask_variables("ecology")

# Analysts can work with masked conditions
williams_blinded |>
  group_by(ecology) |>
  summarise(
    n = n(),
    mean_impulsivity = mean(Impuls_1, na.rm = TRUE),
    .groups = "drop"
  )
#> # A tibble: 2 × 3
#>   ecology              n mean_impulsivity
#>   <chr>            <int>            <dbl>
#> 1 ecology_group_01    56             4.32
#> 2 ecology_group_02    56             4.61

Summary

The vazul package provides a comprehensive toolkit for data blinding:

Function	Level	Purpose
`mask_labels()`	Vector	Replace categorical values with anonymous labels
`mask_variables()`	Data frame	Mask multiple columns
`scramble_values()`	Vector	Randomize value order
`scramble_variables()`	Data frame	Scramble multiple columns
`scramble_variables(..., .byrow = TRUE)`	Row-wise	Scramble values within rows

These functions help researchers conduct unbiased analyses by separating the analyst from knowledge about treatment conditions, group assignments, or individual data points.

Introduction

Masking Functions

mask_labels() - Mask Vector Values

Parameters

Basic Usage

Custom Prefix

Working with Factors

Practical Example with Dataset

mask_variables() - Mask Data Frame Columns

Parameters

Independent Masking (Default)

Shared Masking Across Variables

Using tidyselect Helpers

Scrambling Functions

scramble_values() - Scramble Vector Order

Parameters

Basic Usage with Different Data Types

Key Properties

scramble_variables() - Scramble Data Frame Columns

Parameters

Independent Scrambling (Default)

Scrambling Together

Within-Group Scrambling

Combining Grouping and Together

Practical Example with Dataset

scramble_variables(..., .byrow = TRUE) - Row-Level Scrambling

Parameters

Example: Scrambling Item Responses

Combining Multiple Selectors (Single Combined Set)

Scrambling Separate Groups Independently (Call Multiple Times)

Handling Special Values

Missing Values (NA)

Empty Strings

Choosing Between Masking and Scrambling

When to Use Masking

When to Use Scrambling

Working with Included Datasets

MARP Dataset

Williams Dataset

Summary

`mask_labels()` - Mask Vector Values

`mask_variables()` - Mask Data Frame Columns

`scramble_values()` - Scramble Vector Order

`scramble_variables()` - Scramble Data Frame Columns

`scramble_variables(..., .byrow = TRUE)` - Row-Level Scrambling