---
title: "PSM Annotation and Visualization"
output:
  rmarkdown::html_document:
    toc: true
    toc_float: true
    theme: united
vignette: >
  %\VignetteIndexEntry{PSM-annotation-and-visualization}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.width = 12
)
```

```{r setup}
library(Aerith)
library(dplyr)
library(stringr)
library(ggplot2)
```

```{r, include = FALSE, eval=FALSE}
devtools::load_all()
rmarkdown::render("PSM-annotation-and-visualization.Rmd", output_dir = "../doc/")
```

## Introduction

Peptide-spectrum matching (PSM) annotation and visualization are fundamental components of proteomics data analysis, particularly in stable isotope probing (SIP) experiments. This vignette demonstrates how to use the Aerith package to annotate and visualize PSMs from both unlabeled and stable isotope labeled samples.

### Overview of PSM Annotation

PSM annotation involves matching theoretical peptide fragmentation patterns with observed mass spectrometry data. In SIP experiments, this process becomes more complex due to isotopic labeling, which shifts the mass-to-charge (m/z) ratios of peptide fragments. Aerith addresses this challenge by:

1. **Accurate isotopic modeling**: Incorporating probabilistic models for isotope incorporation
2. **Comprehensive fragment annotation**: Supporting B and Y ion series with multiple charge states
3. **Visual validation**: Providing publication-ready plots for manual inspection and validation
4. **Batch processing**: Enabling high-throughput analysis of multiple PSMs

### Advantages of Aerith

The Aerith package offers several advantages over existing tools:

- **SIP-specific algorithms**: Purpose-built for stable isotope probing experiments
- **Probabilistic isotope modeling**: Accounts for partial isotope incorporation
- **Integrated visualization**: Seamless transition from annotation to publication-ready plots
- **Multi-level analysis**: Supports both MS1 (precursor) and MS2 (fragment) level analysis
- **Flexible parameterization**: Allows fine-tuning for different experimental conditions

This vignette demonstrates these capabilities using two representative examples: an unlabeled PSM at natural ^13^C abundance (1.07%) and a heavily labeled PSM at 50% ^13^C incorporation.

## Analysis of Unlabeled PSM at Natural ^13^C Abundance (1.07%)

This section demonstrates PSM annotation for a peptide at natural ^13^C abundance, representing a control or unlabeled condition in SIP experiments. Natural abundance labeling provides a baseline for comparison with isotope-enriched samples.

### Fragment Ion Annotation at MS2 Level

The first step in PSM validation involves annotating observed MS2 peaks with theoretical B and Y ion fragments. This process identifies which theoretical fragments are actually observed in the spectrum and assesses the quality of the peptide identification.

```{r}
demo_file <- system.file("extdata", "107728.FT2", package = "Aerith")
scan1 <- readOneScanMS2(ftFile = demo_file, 107728)
anno <- annotatePSM(
    scan1$peaks$mz, scan1$peaks$intensity,
    scan1$peaks$charge,
    "HSQVFSTAEDNQSAVTIHVLQGER", 1:2, "C13",
    0.0107, scan1$isolationWindowCenterMZ, 4.0
)
head(anno$ExpectedBYions[anno$ExpectedBYions$matchedIndices != -1, ])
residuePos <- anno$ExpectedBYions$residuePositions[anno$ExpectedBYions$matchedIndices != -1]
table(residuePos)
```

**Parameter Explanation:**

- `scan1$peaks$mz, scan1$peaks$intensity, scan1$peaks$charge`: Observed peak data from the MS2 spectrum
- `"HSQVFSTAEDNQSAVTIHVLQGER"`: The peptide sequence to be annotated
- `1:2`: Charge states to consider for fragment ions (singly and doubly charged)
- `"C13"`: The isotope being tracked (^13^C in this case)
- `0.0107`: The isotope incorporation probability (1.07% for natural ^13^C abundance)
- `scan1$isolationWindowCenterMZ`: The m/z center of the isolation window used for MS2
- `4.0`: The isolation window width in Da for precursor selection

**Result Interpretation:**

The annotation results show which theoretical B and Y ions match observed peaks. The `matchedIndices` column indicates successful matches (non-negative values), while `residuePositions` shows which amino acid positions in the peptide contribute to the matched fragments. The table of residue positions reveals the fragmentation pattern and peptide coverage, which are critical for assessing PSM confidence.

### Interactive PSM Annotation Visualization

The visualization of annotated PSMs provides immediate visual validation of the peptide identification. This plot overlays theoretical fragment ions onto the observed spectrum, allowing researchers to assess the quality of the match and identify potential issues.

```{r}
set.seed(9527)
p <- plotPSMannotation(
    observedSpect = getRealScanFromList(scan1),
    pep = "HSQVFSTAEDNQSAVTIHVLQGER", Atom = "C13", Prob = 0.01,
    charges = 1:2, isoCenter = 886.65, isoWidth = 4.0,
    ifRemoveNotFoundIon = TRUE
)
p
```

**Plot Features and Significance:**

This annotation plot displays several key features:

- **Observed spectrum**: The raw MS2 spectrum with peak intensities
- **Annotated fragments**: Theoretical B and Y ions that match observed peaks, labeled with ion type and position
- **Color coding**: Different colors distinguish between B ions, Y ions, and different charge states
- **Isotopic patterns**: The plot accounts for isotopic distributions based on the specified ^13^C probability

**Parameter Guidance:**

- `Prob = 0.01`: Set to match natural ^13^C abundance (approximately 1%)
- `isoCenter = 886.65`: The precursor m/z used for isolation window centering
- `isoWidth = 4.0`: Isolation window width should match instrumental settings
- `ifRemoveNotFoundIon = TRUE`: Removes theoretical ions with no corresponding observed peaks, reducing visual clutter

The resulting visualization enables manual validation of the automated PSM annotation, which is particularly important for SIP experiments where isotopic shifts can affect fragment matching accuracy.

### Comprehensive Fragment Analysis with Theoretical Overlays

This analysis combines theoretical fragment ion predictions with observed spectrum data to provide a complete picture of peptide fragmentation. The approach is particularly valuable for understanding how isotopic labeling affects fragment ion patterns.

```{r}
a <- getSipBYionSpectra("HSQVFSTAEDNQSAVTIHVLQGER", "C13", 0.01, 1:2)
slot(a, "spectra") <- slot(a, "spectra")[slot(a, "spectra")$MZ < 2000, ]
p <- plot(a)
p <- p + plotSipBYionLabel(a)
demo_file <- system.file("extdata", "107728.FT2", package = "Aerith")
b <- readAllScanMS2(demo_file)
c <- getRealScan(107728, b)
p <- p + plotRealScan(c)
p
```

**Analysis Workflow:**

1. **Theoretical spectrum generation**: `getSipBYionSpectra()` calculates expected B and Y ion m/z values considering isotopic labeling
2. **Mass range filtering**: Limiting to m/z < 2000 focuses on the most informative fragment region
3. **Label addition**: `plotSipBYionLabel()` adds ion annotations for easy identification
4. **Observed data overlay**: The real scan data is superimposed to show agreement between theory and observation

**Advantages of this Approach:**

- **Isotope-aware predictions**: Unlike standard fragmentation tools, Aerith accounts for ^13^C incorporation in fragment calculations
- **Visual validation**: Side-by-side comparison of theoretical and observed spectra enables immediate quality assessment
- **Comprehensive coverage**: Both B and Y ion series are considered, maximizing peptide sequence coverage
- **Charge state flexibility**: Multiple charge states (1+ and 2+) are included to capture the full fragmentation landscape

This visualization is particularly powerful for SIP experiments because it demonstrates how well the isotopic labeling model predicts the observed fragment pattern, which is crucial for accurate peptide quantification in labeled samples.

### Precursor Ion Analysis at MS1 Level

Precursor ion analysis is essential for validating peptide identification and understanding isotopic incorporation. This analysis compares the observed precursor isotopic pattern with theoretical predictions, providing confidence in both peptide identity and labeling quantification.

```{r}
demo_file <- system.file("extdata", "107695.FT1", package = "Aerith")
ft1 <- readOneScanMS1(demo_file, 107695)
precursorScan1 <- getRealScanFromList(ft1)

pep <- "HSQVFSTAEDNQSAVTIHVLQGER"
precursorSP <- getSipPrecursorSpectra(pep, Prob = 0.0107, charges = 3)
slot(precursorSP, "spectra")$Kind <- "Expected"
xlimit <- slot(precursorScan1, "spectra")$MZ > 880 & slot(precursorScan1, "spectra")$MZ < 890
slot(precursorScan1, "spectra") <- slot(precursorScan1, "spectra")[xlimit, ]
slot(precursorScan1, "spectra")$Kind <- "Observed"
maxInt <- max(slot(precursorScan1, "spectra")$Prob)
slot(precursorScan1, "spectra")$Prob <- slot(precursorScan1, "spectra")$Prob / maxInt * 100
p <- plot(precursorSP, linewidth = 0.3) + plotRealScan(precursorScan1, linewidth = 0.3) +
    scale_x_continuous(breaks = seq(880, 890, by = 1)) +
    theme(legend.title = element_blank()) +
    scale_color_manual(values = c("#E7872B", "#F3082F"))
p
```

**Key Analysis Components:**

1. **MS1 data extraction**: Reading the precursor scan corresponding to the MS2 spectrum
2. **Theoretical isotopic pattern**: `getSipPrecursorSpectra()` calculates expected isotopic distribution
3. **Data normalization**: Intensities are normalized to 100% for direct comparison
4. **Focused m/z range**: Analysis is restricted to the precursor region (880-890 m/z)

**Interpretation of Results:**

The precursor analysis reveals several important features:

- **Isotopic pattern match**: Agreement between observed and theoretical patterns validates the peptide identification
- **Natural abundance verification**: The 1.07% ^13^C probability should produce minimal isotopic shifts

**Significance for SIP Experiments:**

This type of precursor analysis is particularly valuable in SIP experiments because:

- It provides independent validation of fragment-based identifications
- It enables direct quantification of isotopic incorporation at the precursor level
- It helps distinguish between labeled and unlabeled peptides
- It supports quality control by revealing potential interferences or contamination

The close agreement between theoretical and observed patterns in this unlabeled sample establishes a baseline for comparison with isotope-enriched samples.

## Analysis of Heavily Labeled PSM at 50% ^13^C Incorporation

This section demonstrates the power of Aerith for analyzing heavily labeled samples, where 50% ^13^C incorporation represents a significant metabolic labeling experiment. High levels of isotopic incorporation create complex isotopic patterns that require sophisticated computational approaches for accurate analysis.

### Fragment Ion Annotation in Labeled Samples

Analyzing fragment ions from heavily labeled peptides presents unique challenges due to substantial mass shifts and complex isotopic envelopes. Aerith's probabilistic approach accounts for these complexities, enabling accurate annotation even at high labeling levels.

```{r}
demo_file <- system.file("extdata", "X13_4068_2596_8182.FT2", package = "Aerith")
scan1 <- readAllScanMS2(ftFile = demo_file)[["2596"]]
anno <- annotatePSM(
    scan1$peaks$mz, scan1$peaks$intensity,
    scan1$peaks$charge,
    "HYAHVDCPGHADYVK", 1:2, "C13",
    0.52, scan1$isolationWindowCenterMZ, 5.0
)
head(anno$ExpectedBYions[anno$ExpectedBYions$matchedIndices != -1, ])
residuePos <- anno$ExpectedBYions$residuePositions[anno$ExpectedBYions$matchedIndices != -1]
table(residuePos)
```

**Critical Parameters for Heavy Labeling:**

- `0.52`: The 52% ^13^C incorporation probability reflects substantial metabolic labeling
- `5.0`: Wider isolation window (5.0 Da or 4.0 Da) accommodates the broadened isotopic envelope
- `"HYAHVDCPGHADYVK"`: A peptide sequence

**Challenges and Solutions in Heavy Labeling:**

1. **Mass shift complexity**: At 52% incorporation, fragment ions exhibit significant mass shifts that vary by carbon content
2. **Isotopic envelope broadening**: Higher labeling creates wider isotopic distributions requiring careful peak matching
3. **Increased computational complexity**: The probabilistic calculations become more intensive but remain tractable

**Biological Significance:**

The successful annotation of this heavily labeled PSM demonstrates active metabolism and protein synthesis in the experimental system. The 52% incorporation level indicates:

- Substantial metabolic activity during the labeling period
- Effective isotope delivery to the cellular system
- Successful incorporation into newly synthesized proteins

This level of labeling is typically achieved in controlled laboratory experiments and represents an ideal scenario for quantitative SIP analysis.

### Visualization of Heavy Labeling Effects

The visualization of heavily labeled PSMs reveals the dramatic effects of isotopic incorporation on mass spectra. This plot demonstrates Aerith's capability to accurately predict and annotate complex isotopic patterns that would be challenging for conventional proteomics tools.

```{r}
set.seed(9527)
p <- plotPSMannotation(
    observedSpect = getRealScanFromList(scan1),
    pep = "HYAHVDCPGHADYVK", Atom = "C13", Prob = 0.52,
    charges = 1:2, isoCenter = scan1$isolationWindowCenterMZ, isoWidth = 5.0,
    ifRemoveNotFoundIon = TRUE
)
p
```

**Visual Features of Heavy Labeling:**

This plot illustrates several key differences from the unlabeled example:

- **Broader isotopic envelopes**: Fragment ions show wider m/z distributions due to variable ^13^C incorporation
- **Mass shifts**: All annotated fragments are shifted to higher m/z values compared to unlabeled equivalents
- **Complex peak patterns**: Individual fragment ions may appear as multiplets rather than single peaks
- **Maintained fragmentation efficiency**: Despite the labeling, B and Y ion series remain well-represented

**Comparison with Natural Abundance:**

Contrasting this heavily labeled spectrum with the natural abundance example reveals:

- **Quantitative isotope effects**: The 52% vs 1% labeling creates dramatically different spectral patterns
- **Predictive accuracy**: Aerith's theoretical calculations accurately match the observed complex patterns
- **Analytical robustness**: The software maintains annotation accuracy across a wide range of labeling levels

**State-of-the-Art Advantages:**

This analysis demonstrates Aerith's advantages over existing proteomics software:

- **Native SIP support**: Unlike standard tools that assume natural isotope abundance, Aerith is designed for variable labeling
- **Probabilistic modeling**: The software calculates isotopic distributions based on biochemically realistic incorporation models
- **Visualization integration**: Seamless transition from calculation to publication-ready plots accelerates analysis workflows

### Theoretical vs Observed Fragment Comparison in Heavy Labeling

This comprehensive comparison demonstrates the exceptional accuracy of Aerith's isotopic modeling even under extreme labeling conditions. The overlay of theoretical and observed spectra provides quantitative validation of the software's predictive capabilities.

```{r}
demo_file <- system.file("extdata", "X13_4068_2596_8182.FT2", package = "Aerith")
ft2 <- readAllScanMS2(demo_file)
a <- getSipBYionSpectra("HYAHVDCPGHADYVK", "C13", 0.52, 1:2)
p <- plot(a)
p <- p + plotSipBYionLabel(a)
c <- getRealScan(2596, ft2)
p <- p + plotRealScan(c)
p
```

**Advanced Modeling Features:**

The theoretical spectrum generation for heavily labeled peptides involves several sophisticated calculations:

1. **Carbon content assessment**: Each fragment's carbon count determines its potential for isotopic incorporation
2. **Binomial probability modeling**: The 52% incorporation probability is applied using binomial distributions
3. **Isotopic envelope calculation**: Complex convolution calculations generate realistic peak shapes
4. **Charge state considerations**: Both singly and doubly charged fragments are modeled with appropriate intensity distributions

**Validation of Computational Accuracy:**

The close agreement between theoretical predictions and observed data validates several aspects of Aerith's approach:

- **Isotope incorporation model**: The 52% probability accurately reflects the biological labeling process
- **Mass calculation precision**: Theoretical m/z values closely match observed peak positions
- **Intensity modeling**: Relative peak intensities are well-predicted across the spectrum
- **Fragmentation coverage**: The model successfully predicts which fragments will be observable

**Implications for Quantitative SIP:**

This level of predictive accuracy has important implications for quantitative stable isotope probing:

- **Reliable quantification**: Accurate theoretical models enable precise measurement of isotopic incorporation
- **Automated analysis**: High-confidence predictions support automated, high-throughput workflows
- **Quality control**: Deviations from predicted patterns can indicate analytical problems or interesting biology
- **Method optimization**: Theoretical modeling can guide experimental design and instrument parameter selection

### Precursor Analysis Under Heavy Labeling Conditions

Precursor ion analysis becomes particularly informative under heavy labeling conditions, where the isotopic envelope provides direct evidence of metabolic incorporation. This analysis demonstrates how Aerith handles complex precursor isotopic patterns that span multiple mass units.

```{r}
demo_file <- system.file("extdata", "X13_2559.FT1", package = "Aerith")
ft1 <- readOneScanMS1(demo_file, 2559)
precursorScan1 <- getRealScanFromList(ft1)

pep <- "HYAHVDCPGHADYVK"
precursorSP <- getSipPrecursorSpectra(pep, Prob = 0.5, charges = 3)
slot(precursorSP, "spectra")$Kind <- "Expected"
xlimit <- slot(precursorScan1, "spectra")$MZ > 590 & slot(precursorScan1, "spectra")$MZ < 620
slot(precursorScan1, "spectra") <- slot(precursorScan1, "spectra")[xlimit, ]
slot(precursorScan1, "spectra")$Kind <- "Observed"
maxInt <- max(slot(precursorScan1, "spectra")$Prob)
slot(precursorScan1, "spectra")$Prob <- slot(precursorScan1, "spectra")$Prob / maxInt * 100
p <- plot(precursorSP, linewidth = 0.3) + plotRealScan(precursorScan1, linewidth = 0.3) +
    scale_x_continuous(breaks = seq(590, 620, by = 5)) +
    theme(legend.title = element_blank()) +
    scale_color_manual(values = c("#E7872B", "#F3082F"))
p
```

**Heavy Labeling Precursor Features:**

The precursor analysis of the heavily labeled peptide reveals several distinctive characteristics:

- **Extended isotopic envelope**: The 50% ^13^C incorporation creates a broad, multimodal isotopic distribution
- **Mass centroid shift**: The envelope center is shifted significantly toward higher m/z values
- **Reduced monoisotopic peak**: The unlabeled (monoisotopic) peak is substantially reduced in intensity
- **Complex peak structure**: Individual isotopic peaks may be resolved, creating a characteristic pattern

**Quantitative Information Content:**

This precursor analysis provides multiple layers of quantitative information:

1. **Incorporation level estimation**: The envelope shape directly reflects the isotopic incorporation percentage
2. **Metabolic activity assessment**: The degree of labeling indicates the extent of biosynthetic activity
3. **Temporal information**: Heavy labeling patterns can reveal the timing of protein synthesis
4. **Quality validation**: Agreement between observed and predicted patterns confirms accurate identification

**Comparison with Natural Abundance:**

Contrasting this heavily labeled precursor with the natural abundance example illustrates:

- **Dramatic isotopic shifts**: The 30 Da mass range (590-620 m/z) encompasses the entire labeled envelope
- **Computational complexity**: Accurate prediction requires sophisticated isotopic modeling
- **Analytical challenges**: Traditional proteomics tools would struggle with such complex patterns
- **Information richness**: Heavy labeling provides far more quantitative information than natural abundance

**Methodological Advantages:**

This analysis showcases several key advantages of the Aerith approach:

- **Seamless scalability**: The same computational framework handles both natural abundance and heavy labeling
- **Predictive accuracy**: Theoretical calculations accurately predict complex isotopic envelopes
- **Integrated workflow**: Precursor and fragment analysis use consistent isotopic modeling
- **Publication-ready visualization**: High-quality plots facilitate data interpretation and presentation

## High-Throughput Batch Processing of PSMs

One of Aerith's key strengths is its ability to process large numbers of PSMs automatically while maintaining the same level of annotation accuracy demonstrated in individual examples. This batch processing capability is essential for proteome-wide SIP experiments.

### Automated Batch Analysis Workflow

The batch processing functionality demonstrates Aerith's scalability and practical utility in real-world proteomics workflows. This example processes multiple PSMs from a single experiment, generating publication-ready plots for each identification.

```{r}
element <- "C13"
demo_file <- system.file("extdata", "demo.psm.txt", package = "Aerith")
psm <- readPSMtsv(demo_file)
psm <- psm[psm$Filename == "Pan_052322_X13.FT2", ]
psm <- psm[psm$ScanNumber %in% c("4068", "2596", "8182"), ]
demo_file <- system.file("extdata", "X13_4068_2596_8182.FT2", package = "Aerith")
ft2 <- readAllScanMS2(demo_file)
ftFileNames <- psm$Filename
scanNumbers <- psm$ScanNumber
proNames <- psm$ProteinNames
charges <- psm$ParentCharge
pep <- psm$OriginalPeptide
pep <- stringr::str_sub(pep, 2, -2)
pct <- psm$SearchName
pct <- as.numeric(stringr::str_sub(
    stringr::str_split(pct, "_", simplify = TRUE)[, 2], 1, -4
)) / 100 / 1000
realScans <- getRealScans(ft2, scanNumbers)
tmp <- tempdir()
plotPSMs(
    realScans,
    charges,
    element,
    pct,
    BYcharge = 1:2,
    ftFileNames,
    scanNumbers,
    pep,
    proNames,
    path = tmp
)
list.files(tmp, pattern = ".pdf", full.names = TRUE)
```

**Batch Processing Components:**

1. **PSM data import**: Reading peptide identification results from standard formats
2. **Parameter extraction**: Automatically parsing isotopic incorporation levels from search parameters
3. **Spectral data loading**: Efficient loading of corresponding MS2 spectra
4. **Automated annotation**: Applying the same annotation algorithms used in individual analysis
5. **Plot generation**: Creating standardized plots for each PSM with appropriate labeling

**Workflow Advantages:**

This automated approach provides several key benefits for large-scale SIP experiments:

- **Consistency**: All PSMs are processed using identical parameters and algorithms
- **Efficiency**: Batch processing eliminates manual intervention for large datasets
- **Standardization**: Uniform plot formatting facilitates comparison across PSMs
- **Documentation**: Automatic generation of analysis records for each identification
- **Quality control**: Systematic processing enables identification of outliers or problems

**Scalability and Performance:**

This batch processing approach scales effectively to proteome-wide datasets:

- **Memory efficiency**: Spectra are loaded and processed incrementally
- **Computational optimization**: Vectorized calculations maximize processing speed
- **Output management**: Organized file naming and directory structure
- **Error handling**: Robust processing continues even if individual PSMs fail

**Integration with Proteomics Workflows:**

The batch processing functionality integrates seamlessly with standard proteomics pipelines:

- **Standard input formats**: Compatible with common search engine outputs
- **Flexible output options**: Multiple plot formats and annotation levels supported
- **Downstream compatibility**: Results integrate with quantitative analysis tools
- **Quality metrics**: Automated calculation of annotation statistics and confidence measures

This automated capability transforms Aerith from a specialized analysis tool into a practical solution for production proteomics workflows, enabling routine application of SIP analysis to large-scale experiments.

## Summary and Best Practices

This vignette has demonstrated the comprehensive PSM annotation and visualization capabilities of the Aerith package. The examples span from natural abundance (1.07% ^13^C) to heavy labeling (52% ^13^C), illustrating the software's versatility and accuracy across diverse experimental conditions.

### Key Takeaways

1. **Isotope-aware analysis**: Aerith provides native support for stable isotope probing experiments, accurately modeling complex isotopic patterns that challenge conventional proteomics tools.

2. **Multi-level validation**: The combination of precursor (MS1) and fragment (MS2) analysis provides comprehensive validation of peptide identifications and quantification of isotopic incorporation.

3. **Visual integration**: Seamless integration of computational analysis with publication-ready visualization accelerates data interpretation and presentation.

4. **Scalable workflows**: Batch processing capabilities enable application to proteome-wide datasets while maintaining annotation accuracy.

### Parameter Selection Guidelines

- **Isotopic incorporation probability**: Should reflect experimental conditions and can be estimated from precursor isotopic patterns or read from PSMs
- **Isolation window parameters**: Must match instrumental settings used for data acquisition
- **Charge state ranges**: Should encompass the typical charge states observed for the peptide length and experimental conditions or read from PSMs

```{r session-info}
sessionInfo()
```