Input Data Format and File Handling in Aerith

library(Aerith)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Overview

This vignette provides a comprehensive guide to understanding and working with various input data formats supported by the Aerith package. Aerith represents a state-of-the-art solution for stable isotope probing (SIP) proteomics analysis, offering robust data handling capabilities that surpass traditional approaches in both performance and flexibility.

What Makes Aerith Unique

The Aerith package distinguishes itself from other proteomics analysis tools through several key advantages:

Multi-format Compatibility: Seamlessly handles multiple mass spectrometry data formats including Sipros’ proprietary FT1/FT2 formats, standard mzML files, MGF files, and pepXML outputs
High-Performance Data Processing: Optimized C++ backend for efficient memory management and fast data processing
Integrated Visualization: Built-in plotting functions for quality control and data exploration
SIP-Specific Features: Specialized tools for stable isotope probing analysis, a rapidly growing field in proteomics
Workflow Integration: Designed to work seamlessly with popular proteomics pipelines including Sipros4/5

Data Format Overview

Modern proteomics experiments generate data in various formats depending on the instrument vendor and analysis software. This vignette demonstrates how Aerith handles:

FT1/FT2 files: Sipros’ proprietary formats for MS1 and MS2 data
mzML files: Open standard format supported by most mass spectrometry platforms
MGF files: Mascot Generic Format, widely used for database searching
pepXML files: Standard format for peptide identification results
Sipros output files: Tab-separated files from Sipros proteomics pipeline

File Format Conversion

Converting Raw Files to Supported Formats

Before using Aerith, Thermo or other vendors’ raw files must be converted to supported formats. This preprocessing step is crucial for ensuring optimal performance and compatibility.

Required Tools for Conversion: - Raxport: Converts Thermo raw files to FT1/FT2 formats (optimized for Aerith) - ThermoRawFileParser: Converts raw files to mzML format (open standard)

Recommended Workflow: 1. Use Raxport for FT1/FT2 conversion (preferred for Aerith workflows) 2. Use ThermoRawFileParser for mzML conversion (for broader compatibility)

Resource Links: - Aerith README - File Conversion Guide - Raxport Tool and Documentation - ThermoRawFileParser - Sipros4 Tutorial - Sipros5 Tutorial - Sipros Conda Environment

Working with FT1 and FT2 Files

FT1 and FT2 files represent Thermo Fisher’s optimized format for MS1 and MS2 data, respectively. These formats offer several advantages including faster read times and smaller file sizes compared to mzML.

Reading MS1 and MS2 Scan Data

The following example demonstrates different approaches to reading FT1 (MS1) data. Each function serves a specific purpose depending on your analysis needs:

rds <- system.file("extdata", "demo.FT1.rds", package = "Aerith")
demo_file <- tempfile(fileext = ".FT1")
writeLines(readRDS(rds), demo_file)

# Read all MS1 scans into memory
# This approach is memory-intensive but provides fastest access for subsequent operations
all_scans <- readAllScanMS1(demo_file)

# Read a specific range of scans (scan numbers 1527-1550)
# Recommended when you know the specific scans of interest
scan_range <- readScansMS1(demo_file, 1527, 1550)

# Read a single scan by scan number
# Most memory-efficient for analyzing individual scans
single_scan <- readOneScanMS1(demo_file, 1555)

# Extract real scan data from the list structure
# Converts internal format to user-friendly data structure
processed_scan <- getRealScanFromList(all_scans[[88]])
plot(processed_scan)

Parameter Selection Guidelines: - Use readAllScanMS1() when analyzing the entire dataset or when memory is not a constraint - Use readScansMS1() with specific scan ranges for targeted analysis - Use readOneScanMS1() for individual scan inspection or when building custom workflows

The plot generated shows the mass spectrum with m/z values on the x-axis and intensity on the y-axis. This visualization is crucial for quality assessment and identifying potential issues with the data.

Now let’s examine MS2 data from FT2 files:

demo_file <- system.file("extdata", "demo.FT2", package = "Aerith")

# Read all MS2 scans
all_ms2_scans <- readAllScanMS2(demo_file)

# Read specific scan range for MS2 data
ms2_range <- readScansMS2(demo_file, 1399, 1500)

# Read individual MS2 scan
single_ms2_scan <- readOneScanMS2(demo_file, 1371)

# Process and visualize MS2 spectrum
processed_ms2_scan <- getRealScanFromList(all_ms2_scans[[128]])
plot(processed_ms2_scan)

MS2 Data Interpretation: The MS2 spectrum shows fragment ions resulting from peptide fragmentation. Peak patterns in MS2 spectra are essential for peptide identification and are particularly important in SIP proteomics where isotope incorporation affects fragmentation patterns.

Creating Subset Files for Testing and Development

Creating smaller subset files is invaluable for testing workflows, debugging, and sharing data samples. This approach allows you to work with manageable datasets while preserving the original file structure.

Creating MS1 Subset Files:

rds <- system.file("extdata", "demo.FT1.rds", package = "Aerith")
demo_file <- tempfile(fileext = ".FT1")
writeLines(readRDS(rds), demo_file)

# Read file header information (essential for maintaining file integrity)
header <- readFTheader(demo_file)

# Read all scans from the original file
ft1_data <- readAllScanMS1(demo_file)

# Create output directory
output_dir <- tempdir()

# Write subset containing first 10 scans
# This preserves file format while reducing file size significantly
writeAllScanMS1(header, ft1_data[1:10], file.path(output_dir, "demo10.FT1"))
#> [1] TRUE

# Verify file creation
subset_files <- list.files(output_dir, pattern = "demo10.FT1", full.names = TRUE)
print(paste("Created subset file:", subset_files))
#> [1] "Created subset file: /tmp/RtmpxyJ0VR/demo10.FT1"

Key Advantages of This Approach: - Maintains original file format and structure - Preserves metadata and header information - Creates files suitable for method development and testing - Reduces computational requirements for iterative analysis

Creating MS2 Subset Files:

demo_file <- system.file("extdata", "demo.FT2", package = "Aerith")

# Read header and scan data
header <- readFTheader(demo_file)
ft2_data <- readAllScanMS2(demo_file)

# Create subset with first 10 MS2 scans
output_dir <- tempdir()
writeAllScanMS2(header, ft2_data[1:10], file.path(output_dir, "demo10.FT2"))
#> [1] TRUE

# Confirm successful file creation
subset_files <- list.files(output_dir, pattern = "demo10.FT2", full.names = TRUE)
print(paste("Created MS2 subset file:", subset_files))
#> [1] "Created MS2 subset file: /tmp/RtmpxyJ0VR/demo10.FT2"

Best Practices for Subset Creation: - Always include representative scans from different retention time ranges - Consider including both high and low intensity scans - Document the selection criteria for reproducibility - Verify that the subset maintains the essential characteristics of the original dataset

Working with mzML Files

mzML (mass spectrometry Markup Language) is an open standard format that provides excellent cross-platform compatibility. Aerith’s mzML support leverages the power of specialized libraries while providing a consistent interface.

Reading mzML Data:

# mzML support requires the mzR package

demo_file <- system.file("extdata", "demo.mzML", package = "Aerith")

# Read MS1 data from mzML file
# mzML files can contain both MS1 and MS2 data in a single file
mzml_ms1_data <- readMzmlMS1(demo_file)

# Extract and visualize a specific MS1 scan
ms1_spectrum <- getRealScan(16, mzml_ms1_data)
plot(ms1_spectrum)


# Read MS2 data from the same mzML file
mzml_ms2_data <- readMzmlMS2(demo_file)

# Extract and visualize a specific MS2 scan
ms2_spectrum <- getRealScan(18, mzml_ms2_data)
plot(ms2_spectrum)

mzML Format Advantages: - Universal Compatibility: Supported by virtually all mass spectrometry software - Rich Metadata: Contains comprehensive instrument and acquisition parameters - Standardized Structure: Facilitates data sharing and collaboration - Vendor Independence: Not tied to specific instrument manufacturers

Plot Interpretation: The MS1 plot displays the precursor ion survey scan, showing the overall complexity of the sample at a given retention time. The MS2 plot reveals the fragmentation pattern of a selected precursor, which is essential for peptide identification. In SIP experiments, these spectra may show characteristic isotope patterns that indicate successful labeling.

Parameter Selection for Scan Extraction: - Scan numbers correspond to the chronological order of acquisition - Choose representative scans from different retention time windows - For method development, select scans with different complexity levels - Consider precursor intensity when selecting MS2 scans for analysis

Working with MGF Files

MGF (Mascot Generic Format) files are widely used for database searching and contain MS2 spectra in a simple, text-based format. This format is particularly popular in proteomics workflows due to its simplicity and broad software support.

Reading MGF Files:

# MGF support requires the MSnbase package

demo_file <- system.file("extdata", "demo.mgf", package = "Aerith")

# Read MGF file containing MS2 spectra
# MGF files typically contain only MS2 spectra with associated metadata
mgf_data <- readMgf(demo_file)

# Extract and visualize a specific spectrum
selected_spectrum <- getRealScan(2688, mgf_data)
plot(selected_spectrum)

MGF Format Characteristics: - Database Search Ready: Designed specifically for search engines like Mascot, SEQUEST, and others - Simplified Structure: Contains only essential information for peptide identification - Metadata Rich: Includes precursor m/z, charge state, and retention time information - Text-Based: Human-readable format facilitating troubleshooting and manual inspection

Spectrum Analysis: The plotted spectrum shows the MS2 fragmentation pattern for the selected precursor. Key features to observe include: - Base Peak: The most intense fragment ion - Ion Series: Patterns of b-ions and y-ions characteristic of peptide fragmentation - Neutral Losses: Common losses like water (-18 Da) or ammonia (-17 Da) - Precursor Peak: May be visible if not completely fragmented

Selection Criteria for Scan Numbers: When choosing scan numbers for analysis, consider: - Signal Quality: Select scans with good signal-to-noise ratio - Fragmentation Efficiency: Choose spectra with rich fragmentation patterns - Precursor Intensity: Higher intensity precursors often yield better fragmentation - Charge State: Different charge states provide complementary information

Processing Peptide Identification Results

Reading pepXML Files

pepXML is a standardized format for storing peptide identification results from database searches. This format is crucial for downstream analysis and provides comprehensive information about peptide-spectrum matches (PSMs).

# pepXML parsing requires the mzR package

demo_file <- system.file("extdata", "demo.pepXML", package = "Aerith")

# Parse pepXML file to extract peptide identification results
# This creates a structured data frame with all identification information
pepxml_results <- readPepXMLtable(demo_file)

# Display structure of the results
str(pepxml_results)
#> 'data.frame':    3 obs. of  28 variables:
#>  $ psmID                    : chr  "controllerType=0 controllerNumber=1 scan=13043_PEP_2" "controllerType=0 controllerNumber=1 scan=13044_PEP_3" "controllerType=0 controllerNumber=1 scan=531_PEP_1"
#>  $ spectrumID               : chr  "controllerType=0 controllerNumber=1 scan=13043" "controllerType=0 controllerNumber=1 scan=13044" "controllerType=0 controllerNumber=1 scan=531"
#>  $ number.of.matched.peaks  : num  10 7 9
#>  $ number.of.unmatched.peaks: num  10 21 13
#>  $ X.Tandem.hyperscore      : num  18.3 15.3 8
#>  $ X.Tandem.expect          : num  0.00909 4.141 2
#>  $ chargeState              : int  2 4 2
#>  $ rank                     : int  1 1 1
#>  $ passThreshold            : logi  TRUE TRUE TRUE
#>  $ experimentalMassToCharge : num  558 411 418
#>  $ calculatedMassToCharge   : num  558 411 418
#>  $ sequence                 : chr  "GYGFITGTDGK" "ISISVDMLGSDDFIK" "KGAVATCK"
#>  $ peptideRef               : chr  "PEP_2" "PEP_3" "PEP_1"
#>  $ modNum                   : int  0 0 0
#>  $ isDecoy                  : logi  FALSE FALSE FALSE
#>  $ post                     : chr  "D" "G" "V"
#>  $ pre                      : chr  "K" "K" "K"
#>  $ start                    : int  0 0 0
#>  $ end                      : int  0 0 0
#>  $ DBseqLength              : int  0 0 0
#>  $ DatabaseSeq              : chr  "" "" ""
#>  $ scan.start.time          : chr  "1849.854" "1849.932" "379.6314"
#>  $ acquisitionNum           : num  13043 13044 531
#>  $ DatabaseAccess           : chr  "MGBC134670_02107,MGBC165238_00126" "MGBC162268_01739" "MGBC115096_01980,MGBC128936_01434,MGBC132046_01637,MGBC118143_01865"
#>  $ DatabaseDescription      : chr  "unannotated protein,unannotated protein" "hypothetical protein" "Bacterial Ig-like domain (group 2),Bacterial Ig-like domain (group 2),Bacterial Ig-like domain (group 2),unannotated protein"
#>  $ name                     : chr  NA NA NA
#>  $ mass                     : num  NA NA NA
#>  $ location                 : int  NA NA NA

pepXML Data Structure: The parsed pepXML file contains essential columns including: - Peptide Sequence: Amino acid sequence of identified peptides - Protein Accession: Database identifiers for matched proteins - Scores: Search engine specific scoring metrics - Modifications: Post-translational modifications and their positions - Spectrum Information: Links to original MS2 spectra

Advantages of pepXML Format: - Standardization: Consistent format across different search engines - Rich Metadata: Contains detailed scoring and statistical information - Protein Grouping: Maintains relationships between peptides and proteins - Modification Support: Comprehensive handling of post-translational modifications

Working with Sipros Output Files

Sipros (Stable Isotope Probing proteomics) generates specialized output formats optimized for SIP analysis. Aerith provides native support for these formats, representing a significant advantage over generic proteomics tools.

Reading PSM Files from Sipros

Sipros generates several types of output files, each serving specific analytical purposes. Aerith’s unified reading functions provide consistent access to these diverse data types.

Standard PSM Results:

demo_file <- system.file("extdata", "demo.psm.txt", package = "Aerith")

# Read peptide-spectrum match results
# Contains identification scores, modifications, and SIP-specific metrics
psm_data <- readPSMtsv(demo_file)

# Display key columns and data structure
head(psm_data)
#>             Filename ScanNumber ParentCharge MeasuredParentMass
#> 1 Pan_052322_X14.FT2      27210            2           1225.764
#> 2 Pan_052322_X13.FT2      23676            2           1312.752
#> 3 Pan_052322_X15.FT2      40863            3           2393.352
#> 4 Pan_052322_X14.FT2       8192            2            959.526
#> 5 Pan_052322_X14.FT2      16534            3           2596.417
#> 6 Pan_052322_X15.FT2      27005            3           1237.785
#>   CalculatedParentMass MassErrorDa MassErrorPPM ScanType   SearchName
#> 1             1223.758    0.001050    0.8580128      Ms2 C13_49000Pct
#> 2             1311.748   -0.000245   -0.1867737      Ms2 C13_58000Pct
#> 3             2388.338    0.002905    1.2163270      Ms2 C13_54000Pct
#> 4              957.519   -0.000330   -0.3446407      Ms2 C13_57000Pct
#> 5             2602.440    0.002770    1.0643857      Ms2 C13_51000Pct
#> 6             1238.789    0.000985    0.7951314      Ms2 C13_52000Pct
#>   ScoringFunction Score DeltaZ DeltaP           IdentifiedPeptide
#> 1       WeightSum 43.64   1.71     NA              [NILMIGPTGVGK]
#> 2       WeightSum 50.29   2.46     NA              [DGSVVVLGYTDR]
#> 3       WeightSum 64.78   1.86     NA      [IETLFTNDLDHGPYISETLR]
#> 4       WeightSum 51.29   0.87     NA                  [VMVMVDDK]
#> 5       WeightSum 46.81   0.46     NA [VETHNHPTAISPWPGAATGSGGEIR]
#> 6       WeightSum 68.29   1.21     NA                [NRDELLALIR]
#>               OriginalPeptide           ProteinNames ProteinCount TargetMatch
#> 1              [NILMIGPTGVGK] {sp|P0A6H5|HSLU_ECOLI}            1        TRUE
#> 2              [DGSVVVLGYTDR] {sp|P0A910|OMPA_ECOLI}            1        TRUE
#> 3      [IETLFTNDLDHGPYISETLR] {sp|P0A8V2|RPOB_ECOLI}            1        TRUE
#> 4                  [VMVMVDDK] {sp|P00961|SYGB_ECOLI}            1        TRUE
#> 5 [VETHNHPTAISPWPGAATGSGGEIR] {sp|P15254|PUR4_ECOLI}            1        TRUE
#> 6                [NRDELLALIR] {sp|P0AE78|CORC_ECOLI}            1        TRUE

Protein Clustering Results:

demo_file <- system.file("extdata", "demo.pro.cluster.txt", package = "Aerith")

# Read protein clustering and grouping information
# Essential for protein-level quantification and SIP analysis
protein_clusters <- readPSMtsv(demo_file)

# Examine protein grouping structure
head(protein_clusters)
#>   X               ProteinID AverageEnrichmentLevel StandardDeviation
#> 1 +   {sp|P0A7W1|RS5_ECOLI}                  58000                 0
#> 2 + {sp|P0AGD7|SRP54_ECOLI}                  48000                 0
#> 3 +   {sp|P0AEU7|SKP_ECOLI}                  48000                 0
#> 4 +  {sp|P40874|MTOX_ECOLI}                  41000                 0
#> 5 +   {sp|P00393|NDH_ECOLI}                  54000                 0
#> 6 +   {sp|P0ACJ0|LRP_ECOLI}                  46000                 0
#>   UniqueSpectrumCounts TotalSpectrumCounts
#> 1                    1                   1
#> 2                    1                   1
#> 3                    3                   3
#> 4                    1                   1
#> 5                    3                   3
#> 6                    8                   8
#>                                    EnrichmentLevels
#> 1                                           {58000}
#> 2                                           {48000}
#> 3                               {48000,48000,48000}
#> 4                                           {41000}
#> 5                               {54000,54000,54000}
#> 6 {46000,46000,46000,46000,46000,46000,46000,46000}
#>                                                                                                       ProteinDescription
#> 1             {sp|P0A7W1|RS5_ECOLI 30S ribosomal protein S5 OS=Escherichia coli (strain K12) OX=83333 GN=rpsE PE=1 SV=2}
#> 2 {sp|P0AGD7|SRP54_ECOLI Signal recognition particle protein OS=Escherichia coli (strain K12) OX=83333 GN=ffh PE=1 SV=1}
#> 3                 {sp|P0AEU7|SKP_ECOLI Chaperone protein Skp OS=Escherichia coli (strain K12) OX=83333 GN=skp PE=1 SV=1}
#> 4       {sp|P40874|MTOX_ECOLI N-methyl-L-tryptophan oxidase OS=Escherichia coli (strain K12) OX=83333 GN=solA PE=1 SV=1}
#> 5   {sp|P00393|NDH_ECOLI Type II NADH:quinone oxidoreductase OS=Escherichia coli (strain K12) OX=83333 GN=ndh PE=1 SV=2}
#> 6 {sp|P0ACJ0|LRP_ECOLI Leucine-responsive regulatory protein OS=Escherichia coli (strain K12) OX=83333 GN=lrp PE=1 SV=2}

SIP-Specific Analysis Results:

demo_file <- system.file("extdata", "demo.sip", package = "Aerith")

# Read SIP analysis results containing isotope incorporation metrics
# This file type is unique to SIP proteomics workflows
sip_results <- readPSMtsv(demo_file)

# Display SIP-specific columns
head(sip_results)
#>             Filename ScanNumber ParentCharge MeasuredParentMass
#> 1 Pan_052322_X13.FT2      35171            2           936.6362
#> 2 Pan_052322_X13.FT2      19237            2           951.5866
#> 3 Pan_052322_X13.FT2      10433            2          1240.6850
#> 4 Pan_052322_X13.FT2      36604            2          1183.7354
#> 5 Pan_052322_X13.FT2      21229            3          2281.3253
#> 6 Pan_052322_X13.FT2      24594            2          1821.0902
#>   CalculatedParentMass ScanType   SearchName ScoringFunction Rank Score
#> 1             936.6366      Ms2 C13_50000Pct       WeightSum    3 12.37
#> 2             951.5837      Ms2 C13_50000Pct       WeightSum    4  7.23
#> 3            1243.6882      Ms2 C13_50000Pct       WeightSum    4  9.94
#> 4            1189.7524      Ms2 C13_50000Pct       WeightSum    4 16.22
#> 5            2292.3554      Ms2 C13_50000Pct       WeightSum    5  7.15
#> 6            1819.0757      Ms2 C13_50000Pct       WeightSum    2  5.00
#>          IdentifiedPeptide          OriginalPeptide               ProteinNames
#> 1            K[AIADAVVKK]D            K[AIADAVVKK]D {Rev_sp|P0AFG8|ODP1_ECOLI}
#> 2             R[GLEREISK]L             R[GLEREISK]L      {sp|P0A9M0|LON_ECOLI}
#> 3           R[QHTHDAPRTR]T           R[QHTHDAPRTR]T     {sp|P0AGK8|ISCR_ECOLI}
#> 4          R[GLLFSGIEVAR]I          R[GLLFSGIEVAR]I {Rev_sp|P0A853|TNAA_ECOLI}
#> 5 K[AMNNLKDGLAKLGQTLYYAR]D K[AMNNLKDGLAKLGQTLYYAR]D     {sp|P76104|RLHA_ECOLI}
#> 6       R[DLHFKERYDVVISR]L       R[DLHFKERYDVVISR]L     {sp|P60872|YIDE_ECOLI}

Spectrum-to-Peptide Mapping:

target_file <- system.file("extdata", "demo_target.Spe2Pep.txt", package = "Aerith")

# Read spectrum-to-peptide mapping files
# These files link MS2 spectra to peptide identifications
spe2pep_data <- readSpe2Pep(target_file)

# Extract PSM information from the parsed data
psm_from_spe2pep <- spe2pep_data$PSM

# Display mapping structure
head(psm_from_spe2pep)
#>   scanNumbers precursorScanNumbers parentCharges isolationWindowCenterMZs
#> 1       12725                12728             2                  832.932
#> 2       12725                12728             2                  832.932
#> 3       12725                12728             2                  832.932
#> 4       12725                12728             2                  832.932
#> 5       12725                12728             2                  832.932
#> 6       12725                12728             2                  832.932
#>   measuredParentMasses calculatedParentMasses retentionTimes WDPscores
#> 1             1663.849               1663.849       28.53885   16.4672
#> 2             1663.849               1668.875       28.53885    4.4907
#> 3             1666.859               1671.884       28.53885    3.4707
#> 4             1663.849               1668.870       28.53885    1.9872
#> 5             1663.849               1661.846       28.53885    1.9872
#> 6             1666.859               1669.869       28.53885    1.9872
#>   XcorrScores MVHscores ranks    identifiedPeptides      originalPeptides
#> 1      1.2468   80.9063     1     R[TYDDDPTKYQDLR]V     R[TYDDDPTKYQDLR]V
#> 2      0.3837    0.0000     2    K[EDTLSDHIESYGIR]S    K[EDTLSDHIESYGIR]S
#> 3      0.2425   19.9707     3    R[IVGGYQNNDSWLDR]Y    R[IVGGYQNNDSWLDR]Y
#> 4      0.1082    0.0000     4 R[SGYSSPGSPGTPGSRSR]T R[SGYSSPGSPGTPGSRSR]T
#> 5      0.1502    0.0000     5      R[FYHDYYGENLFR]T      R[FYHDYYGENLFR]T
#> 6      0.0734    0.0000     6    R[CDILEPGTLQGYDR]D    R[CDILEPGTLQGYDR]D
#>        nakePeptides            proteinNames
#> 1     TYDDDPTKYQDLR  {sp|P0AEM9|TCYJ_ECOLI}
#> 2    EDTLSDHIESYGIR  {sp|P75978|YMFN_ECOLI}
#> 3    IVGGYQNNDSWLDR  {sp|P16869|FHUE_ECOLI}
#> 4 SGYSSPGSPGTPGSRSR         {Con_TAU_HUMAN}
#> 5      FYHDYYGENLFR  {sp|P28629|ADIA_ECOLI}
#> 6    CDILEPGTLQGYDR {sp|P0A9C5|GLN1B_ECOLI}

File Type Interpretation Guide: - PSM files: Core identification results with scoring metrics - Protein cluster files: Protein grouping and quantification data - SIP files: Isotope incorporation analysis and labeling efficiency - Spe2Pep files: Direct mapping between spectra and peptide identifications

Advantages of Sipros Integration: Aerith’s native support for Sipros formats offers several benefits: - Seamless Workflow: Direct reading without format conversion - SIP-Optimized: Preserves SIP-specific metadata and metrics - Performance: Optimized parsing for large Sipros datasets - Consistency: Uniform data structures across different file types

Quality Control and Data Visualization

Aerith provides powerful visualization tools for quality assessment and data exploration. These functions are essential for identifying potential issues and understanding dataset characteristics.

Total Ion Current (TIC) Analysis

Total Ion Current (TIC) analysis provides a comprehensive overview of instrument performance and sample complexity throughout the chromatographic separation. This analysis is crucial for identifying systematic issues and understanding data quality.

# Analyze TIC from MS1 data
rds <- system.file("extdata", "demo.FT1.rds", package = "Aerith")
demo_file <- tempfile(fileext = ".FT1")
writeLines(readRDS(rds), demo_file)
ms1_scans <- readAllScanMS1(demo_file)
ms1_tic <- getTIC(ms1_scans)

# Create TIC plot with specified retention time breaks
# The breaks parameter allows customization of the x-axis for better visualization
plotTIC(ms1_tic, seq(9, 10, by = 0.2))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> ℹ The deprecated feature was likely used in the Aerith package.
#>   Please report the issue at <https://github.com/xyz1396/Aerith/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
#> ℹ Please use the `linewidth` argument instead.
#> ℹ The deprecated feature was likely used in the Aerith package.
#>   Please report the issue at <https://github.com/xyz1396/Aerith/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.


# Analyze TIC from MS2 data
demo_file <- system.file("extdata", "demo.FT2", package = "Aerith")
ms2_scans <- readAllScanMS2(demo_file)
ms2_tic <- getTIC(ms2_scans)

# Plot MS2 TIC with the same retention time range for comparison
plotTIC(ms2_tic, seq(9, 10, by = 0.2))

TIC Plot Interpretation: - Peak Shape: Should show smooth chromatographic peaks indicating proper separation - Baseline Stability: Consistent baseline suggests stable instrument performance - Peak Intensity: Reflects sample concentration and ionization efficiency - Peak Width: Indicates chromatographic resolution and gradient steepness

Parameter Selection Guidelines: - Retention Time Breaks: Choose intervals that highlight important features - Time Range: Focus on the active elution window to avoid empty baseline regions - Resolution: Balance between detail and overview based on analysis goals

Quality Assessment Criteria: - Consistent peak shapes across the chromatographic run - Minimal baseline drift or sudden intensity changes - Appropriate peak capacity for the experimental design - Comparable TIC profiles between technical replicates

Instrument Performance Analysis

Understanding scan frequency and timing is essential for evaluating instrument performance and data acquisition strategies. This analysis reveals the temporal distribution of MS1 and MS2 events.

# Process MS2 data to extract retention time and precursor information
demo_file <- system.file("extdata", "demo.FT2", package = "Aerith")
ms2_scan_data <- readAllScanMS2(demo_file)
ms2_retention_info <- getRetentionTimeAndPrecursorInfo(ms2_scan_data)

# Process MS1 data for comparison
rds <- system.file("extdata", "demo.FT1.rds", package = "Aerith")
demo_file <- tempfile(fileext = ".FT1")
writeLines(readRDS(rds), demo_file)
ms1_scan_data <- readAllScanMS1(demo_file)
ms1_retention_info <- getRetentionTimeAndPrecursorInfo(ms1_scan_data)

# Create combined scan frequency visualization
# This plot shows the temporal distribution of MS1 and MS2 scans
combined_plot <- plotScanFrequency(ms2_retention_info,
    binwidth = 0.1,
    breaks = seq(9, 10, by = 0.2)
) +
    plotScanFrequencyMS2(ms1_retention_info, binwidth = 0.1)

print(combined_plot)

Scan Frequency Analysis Interpretation: - MS1 Frequency: Indicates survey scan rate and data collection efficiency - MS2 Frequency: Reflects precursor selection and fragmentation efficiency - Temporal Distribution: Shows how scan events are distributed across retention time - Balance: Optimal ratio between MS1 and MS2 scans for comprehensive coverage

Parameter Optimization Guidelines: - Binwidth: Smaller values (0.05-0.1) provide higher temporal resolution - Breaks: Set to match your chromatographic gradient and peak capacity - Time Range: Focus on active elution window for meaningful analysis

Performance Indicators: - Consistent scan rates indicate stable instrument operation - Appropriate MS1/MS2 ratio ensures both survey and fragmentation coverage - Even temporal distribution suggests optimal data-dependent acquisition settings

Advanced Precursor Analysis

Precursor m/z distribution analysis provides insights into sample complexity, mass range coverage, and data acquisition effectiveness. This analysis is particularly valuable for SIP experiments where isotope patterns affect precursor masses.

demo_file <- system.file("extdata", "demo.FT2", package = "Aerith")
ms2_data <- readAllScanMS2(demo_file)
precursor_info <- getRetentionTimeAndPrecursorInfo(ms2_data)

# Create precursor m/z frequency plot
# This visualization shows how precursors are distributed across m/z and retention time
plotPrecursorMzFrequency(precursor_info,
    timeBinWidth = 0.1,
    x_breaks = seq(8, 11, by = 0.2)
)

Precursor Distribution Analysis: This 2D visualization reveals several critical aspects of the data:

Mass Range Coverage: Shows which m/z regions are well-sampled
Temporal Distribution: Reveals how precursor selection changes over time
Sample Complexity: Dense regions indicate high complexity or co-elution
Acquisition Bias: Identifies potential biases in precursor selection

Parameter Selection for Optimal Visualization: - timeBinWidth (0.1): Provides good temporal resolution for most LC gradients - x_breaks: Customized to match your retention time window of interest - Mass Range: Automatically scaled to your data range

Interpretation Guidelines: - Hot Spots: High-density regions may indicate co-eluting compounds or abundant species - Coverage Gaps: Empty regions may suggest missed opportunities or instrumental limitations - Temporal Patterns: Changes in m/z distribution over time reflect chromatographic separation - Dynamic Range: Spread of intensities indicates sample complexity

SIP-Specific Considerations: In stable isotope probing experiments, this analysis is particularly valuable for: - Identifying isotope-labeled peptides (mass shifts) - Assessing labeling efficiency across different m/z ranges - Detecting systematic biases in heavy vs. light peptide selection - Optimizing acquisition parameters for SIP workflows

Summary

This vignette demonstrates Aerith’s comprehensive data handling capabilities, showcasing its advantages over traditional proteomics tools:

Key Strengths of Aerith:

Format Versatility: Native support for multiple file formats without external dependencies
Performance Optimization: Efficient memory management and fast data processing
SIP Specialization: Purpose-built tools for stable isotope probing analysis
Integrated Visualization: Built-in quality control and exploration functions
Workflow Integration: Seamless compatibility with established proteomics pipelines

Best Practices Summary:

Always verify data quality using TIC and scan frequency analysis
Use appropriate file formats for your specific workflow requirements
Leverage subset files for method development and testing
Apply consistent parameter selection across related samples
Utilize visualization tools for both quality control and biological interpretation

The combination of robust data handling, specialized SIP features, and comprehensive visualization makes Aerith an invaluable tool for modern proteomics research, particularly in the rapidly growing field of stable isotope probing.

sessionInfo()
#> R Under development (unstable) (2025-10-20 r88955)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4    Aerith_0.99.11
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1            farver_2.1.2               
#>  [3] S7_0.2.1                    fastmap_1.2.0              
#>  [5] lazyeval_0.2.2              XML_3.99-0.20              
#>  [7] digest_0.6.39               lifecycle_1.0.4            
#>  [9] cluster_2.1.8.1             ProtGenerics_1.43.0        
#> [11] statmod_1.5.1               magrittr_2.0.4             
#> [13] compiler_4.6.0              rlang_1.1.6                
#> [15] sass_0.4.10                 tools_4.6.0                
#> [17] igraph_2.2.1                yaml_2.3.11                
#> [19] data.table_1.17.8           knitr_1.50                 
#> [21] labeling_0.4.3              S4Arrays_1.11.1            
#> [23] DelayedArray_0.37.0         plyr_1.8.9                 
#> [25] RColorBrewer_1.1-3          abind_1.4-8                
#> [27] BiocParallel_1.45.0         withr_3.0.2                
#> [29] purrr_1.2.0                 BiocGenerics_0.57.0        
#> [31] grid_4.6.0                  stats4_4.6.0               
#> [33] preprocessCore_1.73.0       ggplot2_4.0.1              
#> [35] iterators_1.0.14            scales_1.4.0               
#> [37] MASS_7.3-65                 MultiAssayExperiment_1.37.2
#> [39] dichromat_2.0-0.1           SummarizedExperiment_1.41.0
#> [41] cli_3.6.5                   crayon_1.5.3               
#> [43] mzR_2.45.0                  rmarkdown_2.30             
#> [45] generics_0.1.4              reshape2_1.4.5             
#> [47] ncdf4_1.24                  affy_1.89.0                
#> [49] cachem_1.1.0                stringr_1.6.0              
#> [51] parallel_4.6.0              impute_1.85.0              
#> [53] BiocManager_1.30.27         vsn_3.79.0                 
#> [55] AnnotationFilter_1.35.0     XVector_0.51.0             
#> [57] matrixStats_1.5.0           vctrs_0.6.5                
#> [59] Matrix_1.7-4                jsonlite_2.0.0             
#> [61] IRanges_2.45.0              S4Vectors_0.49.0           
#> [63] MALDIquant_1.22.3           ggrepel_0.9.6              
#> [65] clue_0.3-66                 foreach_1.5.2              
#> [67] limma_3.67.0                jquerylib_0.1.4            
#> [69] tidyr_1.3.1                 affyio_1.81.0              
#> [71] glue_1.8.0                  MSnbase_2.37.0             
#> [73] codetools_0.2-20            QFeatures_1.21.0           
#> [75] Spectra_1.21.0              stringi_1.8.7              
#> [77] gtable_0.3.6                GenomicRanges_1.63.0       
#> [79] mzID_1.49.0                 tibble_3.3.0               
#> [81] pillar_1.11.1               pcaMethods_2.3.0           
#> [83] htmltools_0.5.8.1           Seqinfo_1.1.0              
#> [85] R6_2.6.1                    doParallel_1.0.17          
#> [87] evaluate_1.0.5              lattice_0.22-7             
#> [89] Biobase_2.71.0              bslib_0.9.0                
#> [91] MetaboCoreUtils_1.19.1      Rcpp_1.1.0                 
#> [93] PSMatch_1.15.0              SparseArray_1.11.7         
#> [95] xfun_0.54                   MsCoreUtils_1.23.1         
#> [97] MatrixGenerics_1.23.0       fs_1.6.6                   
#> [99] pkgconfig_2.0.3