--- title: "Chunked eyerisdb Database Export for Large Datasets" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Chunked eyerisdb Database Export for Large Datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ```{r setup, include=FALSE} library(eyeris) ``` ## Introduction When working with large `eyeris` databases containing millions of eye-tracking data points, traditional export methods can run into memory limitations or create unwieldy files. The chunked database export functionality in `eyeris` provides an out-of-the-box solution for handling really large `eyerisdb` databases by: - **Processing data in chunks** to avoid memory issues - **Automatically splitting large files** based on size limits - **Grouping tables by column structure** to prevent SQL errors - **Supporting both `CSV` and `Parquet` formats** for optimal performance This vignette walks through how to use these features after you've created an `eyerisdb` database using `bidsify(db_enabled = TRUE)`. ## Prerequisites Before using the chunked export functions, you need: 1. An `eyerisdb` database created with `bidsify(db_enabled = TRUE)` 2. The `arrow` package installed (for Parquet support): `install.packages("arrow")` (`arrow` is included when installing `eyeris` from CRAN) 3. Sufficient disk space for the exported files ## Basic Usage ### Simple Export with Default Settings The easiest way to export your entire database is with `eyeris_db_to_chunked_files()`: ```{r basic-export} result <- eyeris_db_to_chunked_files( bids_dir = "/path/to/your/bids/directory", db_path = "my-project" # your database name ) # view what was exported print(result) ``` Using the `eyeris_db_to_chunked_files()` function defaults, this will: - Process `1 million rows` at a time (i.e., the default chunk size) - Create files up to `500MB` each (i.e., the default max file size) - Export all data types found in your database - Save files to `bids_dir/derivatives/eyerisdb_export/my-proj/` ### Understanding the Output Structure The function creates organized output files: ``` derivatives/eyerisdb_export/my-proj/ ├── my-proj_timeseries_chunked_01.csv # Single file (< 500MB) ├── my-proj_events_chunked_01-of-02.csv # Multiple files due to size ├── my-proj_events_chunked_02-of-02.csv ├── my-proj_confounds_summary_goal_chunked_01.csv # Grouped by schema ├── my-proj_confounds_summary_stim_chunked_01.csv # Different column structure ├── my-proj_confounds_events_chunked_01.csv ├── my-proj_epoch_summary_chunked_01.csv └── my-proj_epochs_pregoal_chunked_01-of-03.csv # Epoch-specific data ``` ## Advanced Configuration ### Controlling File Sizes You can customize the maximum file size to create smaller, more manageable files: ```{r file-size-control} # Create smaller files for easy distribution result <- eyeris_db_to_chunked_files( bids_dir = "/path/to/bids", db_path = "large-project", max_file_size_mb = 100, # 100MB files instead of 500MB chunk_size = 500000 # Process 500k rows at a time ) ``` This is particularly useful when: - Uploading to cloud storage with size/transfer bandwidth limits - Sharing data via email or file transfer services - Working with limited storage space ### Exporting Specific Data Types For large databases, you may only need certain types of data: ```{r selective-export} # Export only pupil timeseries and events result <- eyeris_db_to_chunked_files( bids_dir = "/path/to/bids", db_path = "large-project", data_types = c("timeseries", "events"), subjects = c("sub-001", "sub-002", "sub-003") # Specific subjects only ) ``` Available data types typically include: - `timeseries` - Preprocessed eye-tracking pupil data - `events` - Experimental events - `epochs` - Epoched data around events - `confounds_summary` - Confound variables by epoch - `blinks` - Detected blinks ### Using Parquet Format For better performance and compression, use Parquet format: ```{r parquet-export} result <- eyeris_db_to_chunked_files( bids_dir = "/path/to/bids", db_path = "large-project", file_format = "parquet", max_file_size_mb = 200 ) ``` Parquet advantages: - **Smaller file sizes** (often 50-80% smaller than CSV) - **Faster reading** with `arrow::read_parquet()` - **Better data types** (preserves numeric precision) - **Column-oriented** storage for analytics ## Working with the Exported Files ### Reading Single Files Back into R ```{r read-files} # Read a single CSV file data <- read.csv("path/to/timeseries_chunked.csv") # Read a single Parquet file (requires arrow package) if (requireNamespace("arrow", quietly = TRUE)) { data <- arrow::read_parquet("path/to/timeseries_chunked.parquet") } ``` ### Combining Multiple Split Files When files are split due to size limits, you can recombine them: ```{r combine-files} # Find all parts of a split dataset files <- list.files( "path/to/eyerisdb_export/my-project/", pattern = "timeseries_chunked_.*\\.csv$", full.names = TRUE ) # Read and combine all parts combined_data <- do.call(rbind, lapply(files, read.csv)) # Or use the built-in helper function combined_data <- read_eyeris_parquet( parquet_dir = "path/to/eyerisdb_export/my-project/", data_type = "timeseries" ) ``` ## Advanced Use Cases ### Custom Chunk Processing For specialized analysis, you can process chunks with custom functions: ```{r custom-processing} # Connect to database directly con <- eyeris_db_connect("/path/to/bids", "large-project") # Define custom analysis function for pupil data analyze_chunk <- function(chunk) { # Calculate summary statistics for this chunk stats <- data.frame( n_rows = nrow(chunk), subjects = length(unique(chunk$subject_id)), mean_eye_x = mean(chunk$eye_x, na.rm = TRUE), mean_eye_y = mean(chunk$eye_y, na.rm = TRUE), mean_pupil_raw = mean(chunk$pupil_raw, na.rm = TRUE), mean_pupil_processed = mean(chunk$pupil_raw_deblink_detransient_interpolate_lpfilt_z, na.rm = TRUE), missing_pupil_pct = sum(is.na(chunk$pupil_raw)) / nrow(chunk) * 100, hz_modes = paste(unique(chunk$hz), collapse = ",") ) # Save chunk summary (append to growing file) write.csv(stats, "chunk_summaries.csv", append = file.exists("chunk_summaries.csv")) return(TRUE) # Indicate success } # Hypothetical example: process large timeseries dataset in chunks result <- process_chunked_query( con = con, query = " SELECT subject_id, session_id, time_secs, eye_x, eye_y, pupil_raw, pupil_raw_deblink_detransient_interpolate_lpfilt_z, hz FROM timeseries_01_enc_clamp_run01 WHERE pupil_raw > 0 AND eye_x IS NOT NULL ORDER BY time_secs ", chunk_size = 100000, process_chunk = analyze_chunk ) eyeris_db_disconnect(con) ``` ### Handling Very Large Databases For databases with hundreds of millions of rows: ```{r very-large} # Optimize for very large datasets result <- eyeris_db_to_chunked_files( bids_dir = "/path/to/bids", db_path = "massive-project", chunk_size = 2000000, # 2M rows per chunk for efficiency max_file_size_mb = 1000, # 1GB files (larger but fewer files) file_format = "parquet", # Better compression data_types = "timeseries" # Focus on primary data type for analysis ) ``` ## Performance Tips ### Optimizing Chunk Size - **Smaller chunks** (100k-500k rows): Use when memory is limited - **Medium chunks** (1M rows): Good default for most systems - **Larger chunks** (2M+ rows): Use for very large datasets with ample memory ### Choosing Output Format - **CSV**: Human-readable, universal compatibility - **Parquet**: Better performance, smaller files, preserves data types ### File Size Considerations - **Smaller files** (50-100MB): Easier to transfer, upload, email - **Medium files** (200-500MB): Good balance of convenience and efficiency - **Larger files** (1GB+): Fewer files to manage, better for local processing ## Troubleshooting ### Memory Issues If you encounter out-of-memory errors: ```{r memory-fix} # Reduce chunk size result <- eyeris_db_to_chunked_files( bids_dir = "/path/to/bids", db_path = "project", chunk_size = 250000, # Smaller chunks verbose = TRUE # Monitor progress ) ``` ### SQL Query Length Errors The function automatically handles this by processing tables in batches, but if you encounter issues: - The function processes a maximum of 50 tables per SQL query - Tables are automatically grouped by compatible column structures - Complex epoch data is separated into schema-compatible groups ### Column Structure Mismatches When you see "Set operations can only apply to expressions with the same number of result columns": - The function automatically detects and groups tables by column structure - Tables with different schemas (e.g., different epoch types) are exported separately - No manual intervention required - this is handled automatically ### File Access Issues If files are locked or in use: - Ensure no other processes are accessing the database - Check that you have write permissions to the output directory - Close any conflicting programs that might be reading the `eyerisdb` database file ## Getting Help For additional help: - Check function documentation: `?eyeris_db_to_chunked_files` - View database contents: `eyeris_db_summary(bids_dir, db_path)` - List available tables: `eyeris_db_list_tables(con)` - Monitor progress with `verbose = TRUE` ## Summary The built-in chunked `eyerisdb` database export functionality provides a robust solution for working with large `eyerisdb` databases. Key benefits include: - **Scalability**: Handle databases of any size without memory issues - **Flexibility**: Choose file formats, sizes, and data types to export - **Reliability**: Automatic error handling for common database issues - **Efficiency**: Optimized chunk processing with progress tracking This makes it possible to work with even the largest eye-tracking/pupillometry datasets while maintaining performance/reliability without sacrificing the ability to share high-quality, reproducible datasets that support collaborative and open research.