Chunked eyerisdb Database Export for Large Datasets

Introduction

When working with large eyeris databases containing millions of eye-tracking data points, traditional export methods can run into memory limitations or create unwieldy files. The chunked database export functionality in eyeris provides an out-of-the-box solution for handling really large eyerisdb databases by:

This vignette walks through how to use these features after you’ve created an eyerisdb database using bidsify(db_enabled = TRUE).

Prerequisites

Before using the chunked export functions, you need:

  1. An eyerisdb database created with bidsify(db_enabled = TRUE)
  2. The arrow package installed (for Parquet support): install.packages("arrow") (arrow is included when installing eyeris from CRAN)
  3. Sufficient disk space for the exported files

Basic Usage

Simple Export with Default Settings

The easiest way to export your entire database is with eyeris_db_to_chunked_files():

result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/your/bids/directory",
  db_path = "my-project"  # your database name
)

# view what was exported
print(result)

Using the eyeris_db_to_chunked_files() function defaults, this will: - Process 1 million rows at a time (i.e., the default chunk size) - Create files up to 500MB each (i.e., the default max file size) - Export all data types found in your database - Save files to bids_dir/derivatives/eyerisdb_export/my-proj/

Understanding the Output Structure

The function creates organized output files:

derivatives/eyerisdb_export/my-proj/
├── my-proj_timeseries_chunked_01.csv        # Single file (< 500MB)
├── my-proj_events_chunked_01-of-02.csv      # Multiple files due to size
├── my-proj_events_chunked_02-of-02.csv
├── my-proj_confounds_summary_goal_chunked_01.csv   # Grouped by schema
├── my-proj_confounds_summary_stim_chunked_01.csv   # Different column structure
├── my-proj_confounds_events_chunked_01.csv
├── my-proj_epoch_summary_chunked_01.csv
└── my-proj_epochs_pregoal_chunked_01-of-03.csv     # Epoch-specific data

Advanced Configuration

Controlling File Sizes

You can customize the maximum file size to create smaller, more manageable files:

# Create smaller files for easy distribution
result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/bids",
  db_path = "large-project",
  max_file_size_mb = 100,    # 100MB files instead of 500MB
  chunk_size = 500000        # Process 500k rows at a time
)

This is particularly useful when: - Uploading to cloud storage with size/transfer bandwidth limits - Sharing data via email or file transfer services - Working with limited storage space

Exporting Specific Data Types

For large databases, you may only need certain types of data:

# Export only pupil timeseries and events
result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/bids", 
  db_path = "large-project",
  data_types = c("timeseries", "events"),
  subjects = c("sub-001", "sub-002", "sub-003")  # Specific subjects only
)

Available data types typically include: - timeseries - Preprocessed eye-tracking pupil data - events - Experimental events - epochs - Epoched data around events
- confounds_summary - Confound variables by epoch - blinks - Detected blinks

Using Parquet Format

For better performance and compression, use Parquet format:

result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/bids",
  db_path = "large-project",
  file_format = "parquet",
  max_file_size_mb = 200
)

Parquet advantages: - Smaller file sizes (often 50-80% smaller than CSV) - Faster reading with arrow::read_parquet() - Better data types (preserves numeric precision) - Column-oriented storage for analytics

Working with the Exported Files

Reading Single Files Back into R

# Read a single CSV file
data <- read.csv("path/to/timeseries_chunked.csv")

# Read a single Parquet file (requires arrow package)
if (requireNamespace("arrow", quietly = TRUE)) {
  data <- arrow::read_parquet("path/to/timeseries_chunked.parquet")
}

Combining Multiple Split Files

When files are split due to size limits, you can recombine them:

# Find all parts of a split dataset
files <- list.files(
  "path/to/eyerisdb_export/my-project/", 
  pattern = "timeseries_chunked_.*\\.csv$", 
  full.names = TRUE
)

# Read and combine all parts
combined_data <- do.call(rbind, lapply(files, read.csv))

# Or use the built-in helper function
combined_data <- read_eyeris_parquet(
  parquet_dir = "path/to/eyerisdb_export/my-project/",
  data_type = "timeseries"
)

Advanced Use Cases

Custom Chunk Processing

For specialized analysis, you can process chunks with custom functions:

# Connect to database directly
con <- eyeris_db_connect("/path/to/bids", "large-project")

# Define custom analysis function for pupil data
analyze_chunk <- function(chunk) {
  # Calculate summary statistics for this chunk
  stats <- data.frame(
    n_rows = nrow(chunk),
    subjects = length(unique(chunk$subject_id)),
    mean_eye_x = mean(chunk$eye_x, na.rm = TRUE),
    mean_eye_y = mean(chunk$eye_y, na.rm = TRUE), 
    mean_pupil_raw = mean(chunk$pupil_raw, na.rm = TRUE),
    mean_pupil_processed = mean(chunk$pupil_raw_deblink_detransient_interpolate_lpfilt_z, na.rm = TRUE),
    missing_pupil_pct = sum(is.na(chunk$pupil_raw)) / nrow(chunk) * 100,
    hz_modes = paste(unique(chunk$hz), collapse = ",")
  )
  
  # Save chunk summary (append to growing file)
  write.csv(stats, "chunk_summaries.csv", append = file.exists("chunk_summaries.csv"))
  
  return(TRUE)  # Indicate success
}

# Hypothetical example: process large timeseries dataset in chunks
result <- process_chunked_query(
  con = con,
  query = "
    SELECT subject_id, session_id, time_secs, eye_x, eye_y, 
           pupil_raw, pupil_raw_deblink_detransient_interpolate_lpfilt_z, hz
    FROM timeseries_01_enc_clamp_run01 
    WHERE pupil_raw > 0 AND eye_x IS NOT NULL 
    ORDER BY time_secs
  ",
  chunk_size = 100000,
  process_chunk = analyze_chunk
)

eyeris_db_disconnect(con)

Handling Very Large Databases

For databases with hundreds of millions of rows:

# Optimize for very large datasets
result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/bids",
  db_path = "massive-project", 
  chunk_size = 2000000,        # 2M rows per chunk for efficiency
  max_file_size_mb = 1000,     # 1GB files (larger but fewer files)
  file_format = "parquet",     # Better compression
  data_types = "timeseries"    # Focus on primary data type for analysis
)

Performance Tips

Optimizing Chunk Size

Choosing Output Format

File Size Considerations

Troubleshooting

Memory Issues

If you encounter out-of-memory errors:

# Reduce chunk size
result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/bids",
  db_path = "project",
  chunk_size = 250000,  # Smaller chunks
  verbose = TRUE        # Monitor progress
)

SQL Query Length Errors

The function automatically handles this by processing tables in batches, but if you encounter issues:

Column Structure Mismatches

When you see “Set operations can only apply to expressions with the same number of result columns”:

File Access Issues

If files are locked or in use:

Getting Help

For additional help:

Summary

The built-in chunked eyerisdb database export functionality provides a robust solution for working with large eyerisdb databases. Key benefits include:

This makes it possible to work with even the largest eye-tracking/pupillometry datasets while maintaining performance/reliability without sacrificing the ability to share high-quality, reproducible datasets that support collaborative and open research.