--- title: "Working with eyeris Databases: A Complete Guide" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Working with eyeris Databases: A Complete Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` ```{r setup, include=FALSE} library(eyeris) library(dplyr) library(DBI) ``` # Introduction The `eyeris` package provides powerful database functionality that serves as an alternative to CSV file storage for your pupillometry data. This guide will walk you through the complete database workflow, from creation to data extraction, and explain why you might want to choose database storage over traditional CSV files. ## Why Use Databases Instead of CSV Files? ### The Case for Database Storage **1. Scalability and Performance** - **Massive datasets**: When processing hundreds or thousands of subjects, CSV files can become unwieldy. A single database file can efficiently store data from thousands of participants. - **Faster queries**: Instead of reading multiple CSV files and filtering in `R`, databases can filter and aggregate data at the storage level, dramatically reducing memory usage and processing time. - **Concurrent access**: Multiple researchers can query the same database simultaneously without file locking issues. **2. Cloud Computing Advantages** - **Reduced I/O costs**: On cloud platforms (AWS, GCP, Azure), reading many small CSV files incurs significant I/O costs. A single database file reduces these costs dramatically. - **Bandwidth efficiency**: Transferring one database file is much faster than syncing thousands of CSV files. - **Storage optimization**: Databases compress data more efficiently than individual CSV files. **3. Data Integrity and Management** - **Schema validation**: Built-in data type validation prevents corruption. - **Metadata storage**: Databases automatically track creation timestamps, data types, and relationships. **4. Analysis Efficiency** - **`SQL` queries**: Leverage the power of `SQL` for complex filtering, aggregation, and joins. - **Memory efficiency**: Load only the data you need instead of entire datasets. - **Cross-platform compatibility**: `DuckDB` databases work across `R`, `Python`, and other analysis platforms. ### When to Use CSV vs a Database **Use CSV files when:** - Small datasets (< 10 subjects) - Simple analysis workflows - Need human-readable data formats - Working with external tools that require CSV input **Use databases when:** - Large datasets (> 20 subjects) - Complex queries across multiple data types - Cloud computing environments - Collaborative research projects - Long-term data storage and archival # Getting Started: Creating Your First `eyeris` Database ## Basic Database Creation The simplest way to create an `eyeris` database is by enabling database output during the `bidsify()` process: ```{r basic-creation} # load your EyeLink eye-tracking data ASC file eyeris_data <- load_asc("path/to/your/data.asc") # preprocess and epoch your data with eyeris glassbox functions processed_data <- eyeris_data %>% glassbox() %>% epoch( events = "TRIAL_START_{trial_type}_{trial_number}", limits = c(-0.5, 2.0), label = "trial_epochs" ) # Enable eyeris database alongside CSV files bidsify( processed_data, bids_dir = "~/my_eyetracking_study", participant_id = "001", session_num = "01", task_name = "memory_task", csv_enabled = TRUE, # still create CSV files db_enabled = TRUE, # while also creating your eyeris project database db_path = "study_database" # creates study_database.eyerisdb ) ``` ## Database-Only Workflow (Cloud Optimized) For cloud computing or large-scale processing, you can *(and probably should)* considering skipping CSV creation entirely: ```{r cloud-workflow} bidsify( processed_data, bids_dir = "~/my_eyetracking_study", participant_id = "001", session_num = "01", task_name = "memory_task", csv_enabled = FALSE, # skip CSV creation db_enabled = TRUE, # use an eyeris project database only db_path = "study_database" ) ``` ## Batch Processing Multiple Subjects Here's how to efficiently process multiple subjects into a single database: ```{r batch-processing} subjects <- c("001", "002", "003", "004", "005") data_dir <- "~/raw_eyetracking_data" bids_dir <- "~/processed_study_data" for (subject_id in subjects) { cat("Processing subject", subject_id, "\n") subject_data <- file.path( data_dir, paste0("sub-", subject_id), "eye", paste0("sub-", subject_id, ".asc") ) %>% glassbox() %>% epoch( events = "STIMULUS_{condition}_{trial}", limits = c(-1, 3), label = "stimulus_response" ) # then add to eyeris database (which automatically handles subject cleanup) bidsify( subject_data, bids_dir = bids_dir, participant_id = subject_id, session_num = "01", task_name = "attention_task", csv_enabled = FALSE, db_enabled = TRUE, db_path = "attention_study_db" ) } ``` # Connecting to and Exploring `eyeris` Databases ## Basic Database Connection ```{r connection} con <- eyeris_db_connect( bids_dir = "~/processed_study_data", db_path = "attention_study_db" # will look for attention_study_db.eyerisdb ) # be sure to always disconnect when done (or use on.exit to ensure cleanup) on.exit(eyeris_db_disconnect(con)) ``` ## `eyeris` Database Overview and Exploration ```{r exploration} # first get a comprehensive summary of your eyeris project database summary_info <- eyeris_db_summary( "~/processed_study_data", "attention_study_db" ) summary_info$subjects # all subjects in database summary_info$data_types # available data types summary_info$sessions # session information summary_info$tasks # task names summary_info$total_tables # total number of tables # list all available tables all_tables <- eyeris_db_list_tables(con) print(all_tables) # filter tables by data type timeseries_tables <- eyeris_db_list_tables(con, data_type = "timeseries") confounds_tables <- eyeris_db_list_tables(con, data_type = "run_confounds") # filter tables by subject subject_001_tables <- eyeris_db_list_tables(con, subject = "001") ``` # Data Extraction: From Simple to Advanced ## Simple Data Extraction The `eyeris_db_collect()` function provides easy one-liner access to your data: ```{r simple-extraction} # extract ALL data for ALL subjects (returns a named list) all_data <- eyeris_db_collect("~/processed_study_data", "attention_study_db") # view available data types names(all_data) # access specific data types timeseries_data <- all_data$timeseries events_data <- all_data$events confounds_data <- all_data$run_confounds ``` ## Targeted Data Extraction ```{r targeted-extraction} # extract data for specific subjects only subset_subjects <- eyeris_db_collect( bids_dir = "~/processed_study_data", db_path = "attention_study_db", subjects = c("001", "002", "003") ) # extract specific data types only behavioral_data <- eyeris_db_collect( bids_dir = "~/processed_study_data", db_path = "attention_study_db", data_types = c("events", "epochs", "confounds_summary") ) # extract data for specific sessions and tasks session_01_data <- eyeris_db_collect( bids_dir = "~/processed_study_data", db_path = "attention_study_db", sessions = "01", tasks = "attention_task" ) ``` ## Working with Binocular Data ```{r binocular-extraction} # extract data from both eyes binocular_data <- eyeris_db_collect( bids_dir = "~/processed_study_data", db_path = "attention_study_db" ) # the function automatically combines left and right eye data # check if you have binocular data unique(binocular_data$timeseries$eye_suffix) # should show "eyeL" and "eyeR" # extract data for specific eye only left_eye_data <- eyeris_db_collect( bids_dir = "~/processed_study_data", db_path = "attention_study_db", eye_suffixes = "eyeL" ) ``` ## Epoch-Specific Extraction ```{r epoch-extraction} # extract specific epoch data trial_epochs <- eyeris_db_collect( bids_dir = "~/processed_study_data", db_path = "attention_study_db", data_types = c("epochs", "confounds_events", "confounds_summary"), epoch_labels = "stimulus_response" # match your epoch label ) # multiple epoch types multiple_epochs <- eyeris_db_collect( bids_dir = "~/processed_study_data", db_path = "attention_study_db", data_types = "epochs", epoch_labels = c("stimulus_response", "baseline_period") ) ``` ## Output Format Options ```{r output-formats} list_format <- eyeris_db_collect("~/processed_study_data") # access individual data types pupil_data <- list_format$timeseries trial_data <- list_format$epochs ``` # Advanced Database Operations ## Direct SQL Queries For maximum flexibility, you can also write and execute custom SQL queries: ```{r sql-queries} # first connect to your eyeris project database con <- eyeris_db_connect("~/processed_study_data", "attention_study_db") # write your custom SQL query custom_query <- " SELECT subject_id, session_id, task_name, AVG(pupil_raw_deblink_detransient_interpolate_lpfilt_z) as mean_pupil, COUNT(*) as n_samples FROM timeseries_001_01_attention_task_run01_eyeL WHERE pupil_clean IS NOT NULL GROUP BY subject_id, session_id, task_name " results <- DBI::dbGetQuery(con, custom_query) print(results) # a complex cross-table query complex_query <- " SELECT e.subject_id, e.matched_event, e.text_unique, AVG(t.pupil_raw_deblink_detransient_interpolate_lpfilt_z) as mean_pupil_in_epoch, c.blink_rate_hz FROM epochs_001_01_attention_task_run01_stimulus_response_eyeL e JOIN timeseries_001_01_attention_task_run01_eyeL t ON e.subject_id = t.subject_id AND t.time_orig BETWEEN e.epoch_start AND e.epoch_end JOIN run_confounds_001_01_attention_task_run01_eyeL c ON e.subject_id = c.subject_id GROUP BY e.subject_id, e.matched_event, e.text_unique, c.blink_rate_hz " complex_results <- DBI::dbGetQuery(con, complex_query) print(complex_results) ``` ## Reading Individual Tables ```{r individual-tables} # read a specific table directly specific_table <- eyeris_db_read( con = con, table_name = "timeseries_001_01_attention_task_run01_eyeL" ) # read from eyeris database with filters filtered_data <- eyeris_db_read( con = con, data_type = "events", subject = "001", session = "01", task = "attention_task" ) # read epoch data from eyeris database with specific epoch label epoch_data <- eyeris_db_read( con = con, data_type = "epochs", subject = "001", epoch_label = "stimulus_response" ) ``` # Real-World Analysis Examples ## Example 1: Pupil Response Analysis Across Subjects ```{r analysis-example-1} # extract all timeseries data pupil_data <- eyeris_db_collect( "~/processed_study_data", data_types = "timeseries" )$timeseries # analyze pupil responses by subject and condition pupil_summary <- pupil_data %>% filter(!is.na(pupil_clean)) %>% group_by(subject_id, session_id) %>% summarise( mean_pupil = mean(pupil_clean), sd_pupil = sd(pupil_clean), samples_per_subject = n(), .groups = 'drop' ) print(pupil_summary) # compare to loading individual CSV files (which should be much slower!) # csv_files <- list.files("~/processed_study_data", # pattern = "*timeseries*.csv", # recursive = TRUE, full.names = TRUE) # csv_data <- map_dfr(csv_files, read_csv) ``` ## Example 2: Quality Control and Confounds Analysis ```{r analysis-example-2} # extract confounds data for quality control confounds_data <- eyeris_db_collect( "~/processed_study_data", data_types = c("run_confounds", "confounds_summary") ) # identify subjects with poor data quality quality_control <- confounds_data$run_confounds %>% group_by(subject_id, session_id) %>% summarise( mean_blink_rate = mean(blink_rate_hz, na.rm = TRUE), mean_prop_invalid = mean(prop_invalid, na.rm = TRUE), mean_gaze_variance = mean(gaze_x_var_px, na.rm = TRUE), .groups = 'drop' ) %>% mutate( high_blink_rate = mean_blink_rate > 0.5, # arbitrary thresholds high_invalid_data = mean_prop_invalid > 0.3, high_gaze_variance = mean_gaze_variance > 10000, exclude_subject = high_blink_rate | high_invalid_data | high_gaze_variance ) # then view the subjects recommended for exclusion exclude_list <- quality_control %>% filter(exclude_subject) %>% select(subject_id, session_id, exclude_subject) print(exclude_list) ``` # Performance Comparison: Database vs CSV ## Speed and Memory Benchmarks ```{r performance-comparison} # benchmark database approach system.time({ db_data <- eyeris_db_collect( "~/processed_study_data", subjects = c("001", "002", "003", "004", "005"), data_types = "timeseries" ) }) # benchmark CSV approach # system.time({ # csv_files <- list.files("~/processed_study_data", # pattern = "*timeseries*.csv", # recursive = TRUE, full.names = TRUE) # csv_data <- map_dfr(csv_files[1:5], read_csv) # only first 5 subjects # }) # memory usage comparison object.size(db_data) # database extraction # object.size(csv_data) # CSV loading # file size comparison db_file_size <- file.size("~/processed_study_data/derivatives/attention_study_db.eyerisdb") csv_total_size <- sum(file.size(list.files("~/processed_study_data", pattern = "*.csv", recursive = TRUE, full.names = TRUE))) cat("Database file size:", round(db_file_size / 1024^2, 2), "MB\n") cat("Total CSV file size:", round(csv_total_size / 1024^2, 2), "MB\n") cat("Storage efficiency:", round(db_file_size / csv_total_size * 100, 1), "% of CSV size\n") ``` # Best Practices and Tips ## Database Management ```{r best-practices} # 1. Always use descriptive database names bidsify(data, db_path = "study_name_pilot_2024") # good # bidsify(data, db_path = "my-project") # default, not descriptive # 2. Use database-only mode for large studies bidsify(data, csv_enabled = FALSE, db_enabled = TRUE) # efficient # 3. Create separate databases for different experiments bidsify(data, db_path = "experiment_1_attention") bidsify(data, db_path = "experiment_2_memory") # 4. always disconnect from databases con <- eyeris_db_connect("~/data", "study_db") # ... do your work ... # then disconnect ... eyeris_db_disconnect(con) # or use on.exit for automatic cleanup process_data <- function() { con <- eyeris_db_connect("~/data", "study_db") on.exit(eyeris_db_disconnect(con)) # ... your analysis code here ... results <- eyeris_db_collect("~/data", "study_db") return(results) } ``` ## Cloud Computing Optimization ```{r cloud-optimization} # 1. Use database-only workflow to minimize I/O costs process_cloud_data <- function(subject_list, input_bucket, output_bucket) { for (subject in subject_list) { # ... for demo purposes only -- download raw data ... download_from_cloud(subject, input_bucket) # ... process and add to database (no CSV files) ... eyeris_data <- glassbox(local_file) %>% epoch(...) bidsify( eyeris_data, bids_dir = "local_processing", participant_id = subject, csv_enabled = FALSE, # skip CSV for cloud efficiency db_enabled = TRUE, db_path = "cloud_study_db" ) # clean up local files unlink(local_file) } # upload final database back to your cloud upload_to_cloud("cloud_study_db.eyerisdb", output_bucket) } # 2. use database for distributed analysis analyze_cloud_data <- function() { # download only the database file download_from_cloud("cloud_study_db.eyerisdb") # extract only the data you need analysis_data <- eyeris_db_collect( "local_processing", data_types = c("epochs", "confounds_summary"), subjects = target_subjects ) # run analysis on extracted subset results <- run_statistical_analysis(analysis_data) return(results) } ``` ## Error Handling and Debugging ```{r error-handling} # safe eyeris project database operations with error handling safe_extract <- function(bids_dir, db_path, ...) { tryCatch({ data <- eyeris_db_collect(bids_dir, db_path, ...) return(data) }, error = function(e) { cat("Error extracting data:", e$message, "\n") # first check if eyeris project database exists db_file <- file.path(bids_dir, "derivatives", paste0(db_path, ".eyerisdb")) if (!file.exists(db_file)) { cat("eyeris project database file not found:", db_file, "\n") return(NULL) } # try connecting to your eyeris project database con <- tryCatch({ eyeris_db_connect(bids_dir, db_path) }, error = function(e2) { cat("Cannot connect to eyeris project database:", e2$message, "\n") return(NULL) }) if (!is.null(con)) { # list available tables for debugging tables <- eyeris_db_list_tables(con) cat("Available tables:\n") print(tables) eyeris_db_disconnect(con) } return(NULL) }) } # example usage data <- safe_extract("~/my_study", "study_database", subjects = c("001", "002"), data_types = "timeseries") ``` # Migration and Interoperability ## Converting Existing CSV Data to Database ```{r csv-to-database} # if you have existing eyeris-derived CSV files and want to migrate to a database migrate_csv_to_database <- function(bids_dir, db_path) { # ... find all CSV files ... csv_files <- list.files(bids_dir, pattern = "\\.csv$", recursive = TRUE, full.names = TRUE) # ... connect to an eyeris database ... con <- eyeris_db_connect(bids_dir, db_path) on.exit(eyeris_db_disconnect(con)) for (csv_file in csv_files) { cat("Processing:", basename(csv_file), "\n") # parse filename to extract metadata # ... (which of course depends on your CSV naming convention) ... filename_parts <- parse_bids_filename(basename(csv_file)) # ... read CSV data ... csv_data <- read.csv(csv_file) # ... then write to the eyeris project database ... write_eyeris_data_to_db( data = csv_data, con = con, data_type = filename_parts$data_type, sub = filename_parts$subject, ses = filename_parts$session, task = filename_parts$task, # ... other parameters ) } cat("Migration complete!\n") } ``` ## Exporting Database Data Back to CSV ```{r database-to-csv} # export specific data back to CSV format (if needed) export_database_subset <- function(bids_dir, db_path, output_dir) { # ... extract data from the eyeris project database ... data <- eyeris_db_collect(bids_dir, db_path, subjects = c("001", "002"), data_types = c("timeseries", "events")) # ... create an output directory ... dir.create(output_dir, recursive = TRUE) # ... then export each data type ... for (data_type in names(data)) { filename <- file.path(output_dir, paste0(data_type, "_subset.csv")) write.csv(data[[data_type]], filename, row.names = FALSE) cat("Exported:", filename, "\n") } } ``` # Conclusion The eyeris database functionality provides a powerful, scalable alternative to CSV files for eye-tracking data storage and analysis. Key advantages include: - **Performance**: Faster queries and reduced memory usage - **Scalability**: Handle data from hundreds of subjects efficiently - **Cloud optimization**: Reduced I/O costs and bandwidth usage - **Flexibility**: `SQL` queries and cross-platform compatibility Whether you're running a small pilot study or a large-scale multi-session experiment, `eyeris` databases can streamline your workflow and improve analysis efficiency. Start with the simple `eyeris_db_collect()` function for most use cases, and leverage direct `SQL` queries when you need maximum flexibility. For questions or issues with database functionality, please refer to the `eyeris` documentation or open an issue on the GitHub repository. # Session Information ```{r session-info} sessionInfo() ```