Score function test of PSM

library(Aerith)
library(dplyr)
library(stringr)
library(ggplot2)
library(tidyr)

The Aerith package delivers isotope-aware peptide-spectrum matching (PSM) scoring that keeps pace with modern stable isotope probing (SIP) workflows. This vignette walks through two representative scenarios—an unlabeled peptide and a peptide labeled at roughly 50% \(^{13}\)C—to illustrate how Aerith quantifies evidence for isotopic incorporation. Each section begins with the experimental context, introduces the relevant Aerith functions, and explains how to interpret the resulting scores for method development or routine quality control.

Overview of scoring functions

Aerith exposes multiple scoring strategies that capture complementary information:

scorePSM() evaluates observed spectra against theoretical expectations across a user-specified isotopic abundance grid by (WDP) function in Sipros.
annotatePSM() returns cross-correlation (XcorrScore) and multivariate hypergeometric (MVHscore) metrics after annotating fragment ions.

These scores are designed to support threshold selection and benchmarking against state-of-the-art SIP-aware search engines. The following sections demonstrate how to configure their key parameters and interpret the resulting trends.

Unlabeled PSM at 1.07% ¹³C

We begin with an Aerith example file representing an effectively unlabeled spectrum. The goal is to confirm that Aerith reports maximal scores near natural-abundance \(^{13}\)C levels.

demo_file <- system.file("extdata", "107728.FT2", package = "Aerith")
scan2 <- readOneScanMS2(ftFile = demo_file, 107728)
scan1 <- getRealScanFromList(scan2)
spectra <- slot(scan1, "spectra")
charges <- slot(scan1, "charges")
pep <- "HSQVFSTAEDNQSAVTIHVLQGER"
pep2 <- "[HSQVFSTAEDNQSAVTIHVLQGER]"

The readOneScanMS2() and getRealScanFromList() helpers retrieve the centroided fragment spectrum for scan 107728. Throughout this vignette, peptide sequences are provided both as plain strings (pep) for annotation and with bracketed notation (pep2) for scoring functions that expect capped termini.

isoCenter <- scan2$isolationWindowCenterMZ
anno <- annotatePSM(
    spectra$MZ, spectra$Prob,
    spectra$Charge,
    pep, 1:2, "C13",
    0.0107, isoCenter, 5.0, TRUE
)
WDPscores <- sapply((0:100) / 100, simplify = TRUE, function(prob) {
    return(scorePSM(spectra$MZ,
        realIntensity = spectra$Prob,
        realCharge = spectra$Charge, parentCharge = charges[1],
        pepSeq = pep2, Atom = "C13", Prob = prob
    ))
})
scores <- sapply((0:100) / 100, simplify = TRUE, function(prob) {
    anno <- annotatePSM(
        spectra$MZ, spectra$Prob,
        spectra$Charge,
        pep, 1:2, "C13",
        prob, isoCenter, 4.0, TRUE
    )
    return(c(XcorrScore = anno$XcorrScore, MVHscore = anno$MVHscore))
})
scores <- t(scores)
scores <- cbind(WDPscores, scores)

Here we scan the full 0–100% \(^{13}\)C grid to understand score dynamics. Key arguments are:

isoCenter: the isolation window center in \(m/z\), sourced from instrument metadata or PSMs file to anchor precursor annotation.
prob: the hypothesized \(^{13}\)C proportion. The fixed value (0.0107) approximates natural abundance for annotatePSM(), while the loop explores alternative hypotheses.
parentCharge: drawn from scan1@charges to ensure theoretical spectra match the observed charge state.

annotatePSM() produces both cross-correlation and MVH scores, while scorePSM() returns the weighted dot product (WDP) score. Evaluating all three allows users to pick thresholds aligned with their downstream filtering strategy.

scores_df <- as.data.frame(scores)
scores_df$abundance <- 0:100

# Reshape the data to long format for faceting
scores_long <- scores_df %>%
    pivot_longer(
        cols = c(WDPscores, XcorrScore, MVHscore),
        names_to = "score_type",
        values_to = "score_value"
    )

# Create a publication-quality plot
ggplot(scores_long, aes(x = abundance, y = score_value)) +
    geom_line() +
    facet_wrap(~score_type, scales = "free_y") +
    labs(
        x = expression(paste(~ {}^{
            13
        }, "C %")),
        y = "Score",
        title = expression(paste("Scores Across", ~ {}^{
            13
        }, "C %"))
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        strip.text = element_text(size = 12, face = "bold"),
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "white", colour = "black")
    )

The resulting faceted plot summarizes how each score reacts to varying \(^{13}\)C abundance assumptions. For this near-unlabeled spectrum, all metrics peak close to natural abundance, confirming the absence of substantial label incorporation. The free \(y\)-scales highlight the relative sensitivity of each metric without forcing a shared dynamic range.

Labeled PSM at 50% ¹³C

Annotate B and Y ion fragments

We now investigate a spectrum from the Aerith labeled dataset, expecting scores to favor higher \(^{13}\)C proportions consistent with experimental labeling.

demo_file <- system.file("extdata", "X13_4068_2596_8182.FT2", package = "Aerith")
ft2 <- readAllScanMS2(demo_file)
scan2 <- ft2[["2596"]]
scan1 <- getRealScanFromList(scan2)
spectra <- slot(scan1, "spectra")
charges <- slot(scan1, "charges")
pep <- "HYAHVDCPGHADYVK"
pep2 <- "[HYAHVDCPGHADYVK]"
pct <- 0.52

The full scan list is loaded with readAllScanMS2() so we can reference scan 2596 explicitly.

isoCenter <- scan2$isolationWindowCenterMZ
WDPscores <- sapply((0:100) / 100, simplify = TRUE, function(prob) {
    return(scorePSM(spectra$MZ,
        realIntensity = spectra$Prob,
        realCharge = spectra$Charge, parentCharge = charges[1],
        pepSeq = pep2, Atom = "C13", Prob = prob
    ))
})
scores <- sapply((0:100) / 100, simplify = TRUE, function(prob) {
    anno <- annotatePSM(
        spectra$MZ, spectra$Prob,
        spectra$Charge,
        pep, 1:2, "C13",
        prob, isoCenter, 4.0, TRUE
    )
    return(c(XcorrScore = anno$XcorrScore, MVHscore = anno$MVHscore))
})
scores <- t(scores)
scores <- cbind(WDPscores, scores)

As before, we iterate across the full probability grid to visualize the likelihood landscape. Practitioners may narrow the grid around anticipated labeling levels (for example, 30–70%) to accelerate large-scale studies, but the full sweep is informative when benchmarking instrument performance or validating new sample types.

scores_df <- as.data.frame(scores)
scores_df$abundance <- 0:100

# Reshape the data to long format for faceting
scores_long <- scores_df %>%
    pivot_longer(
        cols = c(WDPscores, XcorrScore, MVHscore),
        names_to = "score_type",
        values_to = "score_value"
    )

# Create a publication-quality plot
ggplot(scores_long, aes(x = abundance, y = score_value)) +
    geom_line() +
    facet_wrap(~score_type, scales = "free_y") +
    labs(
        x = expression(paste(~ {}^{
            13
        }, "C %")),
        y = "Score",
        title = expression(paste("Scores Across", ~ {}^{
            13
        }, "C %"))
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12),
        strip.text = element_text(size = 12, face = "bold"),
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "white", colour = "black")
    )

In this labeled scenario, each score exhibits a clear maximum near 50% abundance, mirroring the known experimental condition. Divergence among the score shapes highlights how cross-correlation and hypergeometric statistics can provide complementary evidence; Aerith lets analysts combine or prioritize these metrics depending on downstream classification needs.

Guidance on parameter selection and interpretation

Isolation window (isoCenter): always source from spectrum metadata to avoid systematic shifts in the theoretical isotope envelope.
Probability grid: dense grids (e.g., 0–1% steps) aid exploratory analyses and method validation, while coarser grids reduce compute cost during high-throughput processing. Align step size with the expected precision of labeling efficiency estimates.
Score evaluation: identify the probability that maximizes each metric. Consistent peaks across WDP, Xcorr, and MVH suggest robust assignment. Discrepancies can indicate spectrum quality issues or mixed precursor populations.
Thresholding: use empirical distributions of these scores across known target and decoy sets to set decision thresholds, leveraging Aerith’s compatibility with Percolator and other state-of-the-art post-processors.

Why Aerith stands out

Aerith integrates isotope-aware scoring, fragment annotation, and downstream export utilities within a single R-native workflow. Compared with generic proteomics search engines, Aerith explicitly models SIP isotope envelopes, offers fine-grained control over labeling hypotheses, and interoperates with modern statistical rescoring pipelines. This alignment with current SIP best practices ensures that researchers can bridge method development and routine analysis without leaving the R ecosystem.

sessionInfo()
#> R Under development (unstable) (2025-10-20 r88955)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] tidyr_1.3.1    ggplot2_4.0.1  stringr_1.6.0  dplyr_1.1.4    Aerith_0.99.11
#> 
#> loaded via a namespace (and not attached):
#>  [1] rlang_1.1.6                 magrittr_2.0.4             
#>  [3] clue_0.3-66                 matrixStats_1.5.0          
#>  [5] compiler_4.6.0              systemfonts_1.3.1          
#>  [7] vctrs_0.6.5                 reshape2_1.4.5             
#>  [9] ProtGenerics_1.43.0         pkgconfig_2.0.3            
#> [11] MetaboCoreUtils_1.19.1      crayon_1.5.3               
#> [13] fastmap_1.2.0               XVector_0.51.0             
#> [15] labeling_0.4.3              rmarkdown_2.30             
#> [17] preprocessCore_1.73.0       ragg_1.5.0                 
#> [19] purrr_1.2.0                 xfun_0.54                  
#> [21] MultiAssayExperiment_1.37.2 cachem_1.1.0               
#> [23] jsonlite_2.0.0              DelayedArray_0.37.0        
#> [25] BiocParallel_1.45.0         parallel_4.6.0             
#> [27] cluster_2.1.8.1             R6_2.6.1                   
#> [29] bslib_0.9.0                 stringi_1.8.7              
#> [31] RColorBrewer_1.1-3          limma_3.67.0               
#> [33] GenomicRanges_1.63.0        jquerylib_0.1.4            
#> [35] Rcpp_1.1.0                  Seqinfo_1.1.0              
#> [37] SummarizedExperiment_1.41.0 iterators_1.0.14           
#> [39] knitr_1.50                  IRanges_2.45.0             
#> [41] Matrix_1.7-4                igraph_2.2.1               
#> [43] tidyselect_1.2.1            dichromat_2.0-0.1          
#> [45] abind_1.4-8                 yaml_2.3.11                
#> [47] doParallel_1.0.17           codetools_0.2-20           
#> [49] affy_1.89.0                 lattice_0.22-7             
#> [51] tibble_3.3.0                plyr_1.8.9                 
#> [53] Biobase_2.71.0              withr_3.0.2                
#> [55] S7_0.2.1                    evaluate_1.0.5             
#> [57] Spectra_1.21.0              pillar_1.11.1              
#> [59] affyio_1.81.0               BiocManager_1.30.27        
#> [61] MatrixGenerics_1.23.0       foreach_1.5.2              
#> [63] stats4_4.6.0                MSnbase_2.37.0             
#> [65] MALDIquant_1.22.3           ncdf4_1.24                 
#> [67] generics_0.1.4              S4Vectors_0.49.0           
#> [69] scales_1.4.0                glue_1.8.0                 
#> [71] lazyeval_0.2.2              tools_4.6.0                
#> [73] mzID_1.49.0                 data.table_1.17.8          
#> [75] QFeatures_1.21.0            vsn_3.79.0                 
#> [77] mzR_2.45.0                  fs_1.6.6                   
#> [79] XML_3.99-0.20               grid_4.6.0                 
#> [81] impute_1.85.0               MsCoreUtils_1.23.1         
#> [83] PSMatch_1.15.0              cli_3.6.5                  
#> [85] textshaping_1.0.4           S4Arrays_1.11.1            
#> [87] AnnotationFilter_1.35.0     pcaMethods_2.3.0           
#> [89] gtable_0.3.6                sass_0.4.10                
#> [91] digest_0.6.39               BiocGenerics_0.57.0        
#> [93] SparseArray_1.11.7          ggrepel_0.9.6              
#> [95] farver_2.1.2                htmltools_0.5.8.1          
#> [97] lifecycle_1.0.4             statmod_1.5.1              
#> [99] MASS_7.3-65