Chunked eyerisdb Database Export for Large Datasets

Introduction

When working with large eyeris databases containing millions of eye-tracking data points, traditional export methods can run into memory limitations or create unwieldy files. The chunked database export functionality in eyeris provides an out-of-the-box solution for handling really large eyerisdb databases by:

Processing data in chunks to avoid memory issues
Automatically splitting large files based on size limits
Grouping tables by column structure to prevent SQL errors
Supporting both CSV and Parquet formats for optimal performance

This vignette walks through how to use these features after you’ve created an eyerisdb database using bidsify(db_enabled = TRUE).

Prerequisites

Before using the chunked export functions, you need:

An eyerisdb database created with bidsify(db_enabled = TRUE)
The arrow package installed (for Parquet support): install.packages("arrow") (arrow is included when installing eyeris from CRAN)
Sufficient disk space for the exported files

Basic Usage

Simple Export with Default Settings

The easiest way to export your entire database is with eyeris_db_to_chunked_files():

result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/your/bids/directory",
  db_path = "my-project"  # your database name
)

# view what was exported
print(result)

Using the eyeris_db_to_chunked_files() function defaults, this will: - Process 1 million rows at a time (i.e., the default chunk size) - Create files up to 500MB each (i.e., the default max file size) - Export all data types found in your database - Save files to bids_dir/derivatives/eyerisdb_export/my-proj/

Understanding the Output Structure

The function creates organized output files:

derivatives/eyerisdb_export/my-proj/
├── my-proj_timeseries_chunked_01.csv        # Single file (< 500MB)
├── my-proj_events_chunked_01-of-02.csv      # Multiple files due to size
├── my-proj_events_chunked_02-of-02.csv
├── my-proj_confounds_summary_goal_chunked_01.csv   # Grouped by schema
├── my-proj_confounds_summary_stim_chunked_01.csv   # Different column structure
├── my-proj_confounds_events_chunked_01.csv
├── my-proj_epoch_summary_chunked_01.csv
└── my-proj_epochs_pregoal_chunked_01-of-03.csv     # Epoch-specific data

Advanced Configuration

Controlling File Sizes

You can customize the maximum file size to create smaller, more manageable files:

# Create smaller files for easy distribution
result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/bids",
  db_path = "large-project",
  max_file_size_mb = 100,    # 100MB files instead of 500MB
  chunk_size = 500000        # Process 500k rows at a time
)

This is particularly useful when: - Uploading to cloud storage with size/transfer bandwidth limits - Sharing data via email or file transfer services - Working with limited storage space

Exporting Specific Data Types

For large databases, you may only need certain types of data:

# Export only pupil timeseries and events
result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/bids", 
  db_path = "large-project",
  data_types = c("timeseries", "events"),
  subjects = c("sub-001", "sub-002", "sub-003")  # Specific subjects only
)

Available data types typically include: - timeseries - Preprocessed eye-tracking pupil data - events - Experimental events - epochs - Epoched data around events
- confounds_summary - Confound variables by epoch - blinks - Detected blinks

Using Parquet Format

For better performance and compression, use Parquet format:

result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/bids",
  db_path = "large-project",
  file_format = "parquet",
  max_file_size_mb = 200
)

Parquet advantages: - Smaller file sizes (often 50-80% smaller than CSV) - Faster reading with arrow::read_parquet() - Better data types (preserves numeric precision) - Column-oriented storage for analytics

Working with the Exported Files

Reading Single Files Back into R

# Read a single CSV file
data <- read.csv("path/to/timeseries_chunked.csv")

# Read a single Parquet file (requires arrow package)
if (requireNamespace("arrow", quietly = TRUE)) {
  data <- arrow::read_parquet("path/to/timeseries_chunked.parquet")
}

Combining Multiple Split Files

When files are split due to size limits, you can recombine them:

# Find all parts of a split dataset
files <- list.files(
  "path/to/eyerisdb_export/my-project/", 
  pattern = "timeseries_chunked_.*\\.csv$", 
  full.names = TRUE
)

# Read and combine all parts
combined_data <- do.call(rbind, lapply(files, read.csv))

# Or use the built-in helper function
combined_data <- read_eyeris_parquet(
  parquet_dir = "path/to/eyerisdb_export/my-project/",
  data_type = "timeseries"
)

Advanced Use Cases

Custom Chunk Processing

For specialized analysis, you can process chunks with custom functions:

# Connect to database directly
con <- eyeris_db_connect("/path/to/bids", "large-project")

# Define custom analysis function for pupil data
analyze_chunk <- function(chunk) {
  # Calculate summary statistics for this chunk
  stats <- data.frame(
    n_rows = nrow(chunk),
    subjects = length(unique(chunk$subject_id)),
    mean_eye_x = mean(chunk$eye_x, na.rm = TRUE),
    mean_eye_y = mean(chunk$eye_y, na.rm = TRUE), 
    mean_pupil_raw = mean(chunk$pupil_raw, na.rm = TRUE),
    mean_pupil_processed = mean(chunk$pupil_raw_deblink_detransient_interpolate_lpfilt_z, na.rm = TRUE),
    missing_pupil_pct = sum(is.na(chunk$pupil_raw)) / nrow(chunk) * 100,
    hz_modes = paste(unique(chunk$hz), collapse = ",")
  )
  
  # Save chunk summary (append to growing file)
  write.csv(stats, "chunk_summaries.csv", append = file.exists("chunk_summaries.csv"))
  
  return(TRUE)  # Indicate success
}

# Hypothetical example: process large timeseries dataset in chunks
result <- process_chunked_query(
  con = con,
  query = "
    SELECT subject_id, session_id, time_secs, eye_x, eye_y, 
           pupil_raw, pupil_raw_deblink_detransient_interpolate_lpfilt_z, hz
    FROM timeseries_01_enc_clamp_run01 
    WHERE pupil_raw > 0 AND eye_x IS NOT NULL 
    ORDER BY time_secs
  ",
  chunk_size = 100000,
  process_chunk = analyze_chunk
)

eyeris_db_disconnect(con)

Handling Very Large Databases

For databases with hundreds of millions of rows:

# Optimize for very large datasets
result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/bids",
  db_path = "massive-project", 
  chunk_size = 2000000,        # 2M rows per chunk for efficiency
  max_file_size_mb = 1000,     # 1GB files (larger but fewer files)
  file_format = "parquet",     # Better compression
  data_types = "timeseries"    # Focus on primary data type for analysis
)

Performance Tips

Optimizing Chunk Size

Smaller chunks (100k-500k rows): Use when memory is limited
Medium chunks (1M rows): Good default for most systems
Larger chunks (2M+ rows): Use for very large datasets with ample memory

Choosing Output Format

CSV: Human-readable, universal compatibility
Parquet: Better performance, smaller files, preserves data types

File Size Considerations

Smaller files (50-100MB): Easier to transfer, upload, email
Medium files (200-500MB): Good balance of convenience and efficiency
Larger files (1GB+): Fewer files to manage, better for local processing

Troubleshooting

Memory Issues

If you encounter out-of-memory errors:

# Reduce chunk size
result <- eyeris_db_to_chunked_files(
  bids_dir = "/path/to/bids",
  db_path = "project",
  chunk_size = 250000,  # Smaller chunks
  verbose = TRUE        # Monitor progress
)

SQL Query Length Errors

The function automatically handles this by processing tables in batches, but if you encounter issues:

The function processes a maximum of 50 tables per SQL query
Tables are automatically grouped by compatible column structures
Complex epoch data is separated into schema-compatible groups

Column Structure Mismatches

When you see “Set operations can only apply to expressions with the same number of result columns”:

The function automatically detects and groups tables by column structure
Tables with different schemas (e.g., different epoch types) are exported separately
No manual intervention required - this is handled automatically

File Access Issues

If files are locked or in use:

Ensure no other processes are accessing the database
Check that you have write permissions to the output directory
Close any conflicting programs that might be reading the eyerisdb database file

Getting Help

For additional help:

Check function documentation: ?eyeris_db_to_chunked_files
View database contents: eyeris_db_summary(bids_dir, db_path)
List available tables: eyeris_db_list_tables(con)
Monitor progress with verbose = TRUE

Summary

The built-in chunked eyerisdb database export functionality provides a robust solution for working with large eyerisdb databases. Key benefits include:

Scalability: Handle databases of any size without memory issues
Flexibility: Choose file formats, sizes, and data types to export
Reliability: Automatic error handling for common database issues
Efficiency: Optimized chunk processing with progress tracking

This makes it possible to work with even the largest eye-tracking/pupillometry datasets while maintaining performance/reliability without sacrificing the ability to share high-quality, reproducible datasets that support collaborative and open research.