Phosphoproteomics provides rich information for dissecting pathway activities
and therefore is becoming vital in basic and translational biomedical research.
To facilitate and streamline phosphoproteomics data analysis, we developed
SmartPhos
, an R package for the pre-processing, quality control, and
exploratory analysis of phosphoproteomics data generated by mass-spectrometry.
SmartPhos can process outputs from MaxQuant
and Spectronaut
either using
the R command line or in an interactive ShinyApp called SmartPhos Explorer
.
Besides commonly used preprocessing steps, such as normalization,
transformation, imputation, and batch effect correction, our package features a
novel method for correcting normalization artifacts observed in
phosphoproteomics, especially when large global phosphorylation changes are
expected, by taking both phospho-enriched and unenriched samples into account.
In addition, the SmartPhos Explorer
ShinyApp
included in our R package
provides a user-friendly and interactive one-stop solution for performing
exploratory data analysis (PCA, hierarchical clustering, etc.), differential
expression, time-series clustering, gene set enrichment analysis, and kinase
activity analysis easily without the knowledge of coding or the underlying
statistical model.
This vignette focuses on using SmartPhos
with the R command line.
To install SmartPhos
enter the following to the R
console
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SmartPhos")
Load the package
library(MultiAssayExperiment)
library(SmartPhos)
Load the R object
data("dia_example")
dia_example
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] Phosphoproteome: SummarizedExperiment with 500 rows and 47 columns
## [2] Proteome: SummarizedExperiment with 500 rows and 47 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Extract the SummarizedExperiment object from the multiAssayExperiment object
se <- dia_example[["Phosphoproteome"]]
colData(se) <- colData(dia_example)
se
## class: SummarizedExperiment
## dim: 500 47
## metadata(0):
## assays(1): Intensity
## rownames(500): s1 s2 ... s499 s500
## rowData names(7): UniprotID Gene ... Sequence site
## colnames(47): FullProteome_1stCrtl_0min_rep2
## FullProteome_1stCrtl_0min_rep3 ... Phospho_HGF_24h_rep1
## Phospho_HGF_100min_rep1
## colData names(6): sample treatment ... sampleType sampleName
SmartPhos package performs some pre-filtering and preprocessing based on various threshold values during the generation of the multiAssayExperiment object. On top of that, this panel has different options for preprocessing of the selected assay. The options provided by the shiny app are as follows:
The preprocessing (transformation, normalization, imputation, etc.) on the intensity assay can be performed by running the following command:
newSE <- preprocessPhos(seData = se, transform = "log2", normalize = TRUE,
impute = "QRILC")
Plot the boxplot of assay intensities:
plotIntensity(newSE, colorByCol = "replicate")
Plot the completeness (percentage of non-missing) for each sample:
plotMissing(newSE)
This is performed on the original assay and not on the imputed assay.
Perform principal component analysis (PCA) by running the following command:
# perform PCA
pca <- stats::prcomp(t(assays(newSE)[["imputed"]]), center = TRUE,
scale. = TRUE)
# call the plotting function
p <- plotPCA(pca = pca, se = newSE, color = "replicate")
p
Heatmap can be plotted using the plotHeatmap() function of SmartPhos. There are three different heatmaps possible based on the type argument:
plotHeatmap(type = "Top variant", newSE, top = 10,
annotationCol = c("replicate", "treatment"))
The goal of performing differential expression analysis is to quantify the expression levels of genes between different experimental conditions using statistical tests.
If the users want paired t-test on the patient IDs, cell lines, etc, then the users must have subjectID as one of the column in the fileTable.txt file with the relevant information. The subjectID column should also be selected in the additional column annotations before the generation of multiAssayExperiment object.
The performDifferentialExp() has two methods for performing differential expression analysis:
Limma method needs an assay with non-missing data and is faster than the proDA method. ProDA method can work on assay with missing values also.
dea <- performDifferentialExp(se = newSE, assay = "imputed", method = "limma",
condition = "treatment", reference = "EGF",
target = "1stCrtl")
‘dea’ is a list object which contains two values: ‘resDE’ with tabular differential expression analysis results and ‘seSub’ which is the subsetted SummarizedExperiment object based on the selected condition.
dea$seSub
## class: SummarizedExperiment
## dim: 172 13
## metadata(0):
## assays(2): Intensity imputed
## rownames(172): s4 s6 ... s499 s500
## rowData names(7): UniprotID Gene ... Sequence site
## colnames(13): Phospho_1stCrtl_0min_rep2 Phospho_1stCrtl_0min_rep3 ...
## Phospho_EGF_24h_rep1 Phospho_EGF_100min_rep1
## colData names(7): sample treatment ... sampleName comparison
dea$resDE
## # A tibble: 172 × 12
## ID log2FC stat pvalue padj UniprotID Gene Multiplicity Position
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <int> <chr>
## 1 s447 2.18 4.81 0.000386 0.0663 O00401 WASL 1 256
## 2 s318 2.94 3.99 0.00167 0.144 O00139 KIF2A 1 484
## 3 s28 -1.61 -3.56 0.00373 0.149 A0MZ66 SHTN1 1 101
## 4 s156 2.36 3.51 0.00409 0.149 A0A590UJ23;A0… PALM3 2 436
## 5 s184 -1.62 -3.48 0.00434 0.149 A6NKD9 CCDC… 2 251
## 6 s4 1.01 2.82 0.0151 0.434 A0AV96 RBM47 1 519
## 7 s381 2.60 2.58 0.0236 0.559 O00231 PSMD… 1 23
## 8 s159 2.96 2.46 0.0293 0.559 A0A590UJ23;A0… PALM3 2 548
## 9 s173 0.927 2.43 0.0309 0.559 A6NFI3 ZNF3… 3 10
## 10 s157 1.65 2.39 0.0337 0.559 A0A590UJ23;A0… PALM3 2 439
## # ℹ 162 more rows
## # ℹ 3 more variables: Residue <chr>, Sequence <chr>, site <chr>
Plot the volcano plot for the obtained differential expression analysis results by running the following command:
plotVolcano(dea$resDE)
pFilter and fcFilter arguments can be used to change the thresholds for the upregulated and downregulated genes being represented in the volcano plot.
Use intensityBoxPlot() to plot the intensity data of a specified gene, with optional subject-specific lines:
intensityBoxPlot(se = dea$seSub, id = 's447', symbol = "WASL")
Time series clustering is performed to group proteins/phosphopeptides based on how their level changes over time. The clusterTS() function employs fuzzy c-means clustering algorithm which considers the time-resolved trend, but not the expression levels. Thus, members of the same cluster would have a similar trend over time (e.g., all increasing or all decreasing), though their expression level can be different.
To use this function, users must have logitudinal data with timepoint annotation in their data. The time points are either unit-less numbers (e.g., 1, 2, 3) or are in hour and/or minute, which must be typed as “h” and “min”, respectively (e.g., 1min, 2min, 3h). Please notice that mixing the two said options (e.g., 1, 2min, 3) or using other unit for time (e.g., 1hour, 2minute, 3day) will likely lead to wrong results.
# call addZeroTime function to add zero timepoint to EGF treatment
newSEzero <- addZeroTime(newSE, condition = "treatment", treat = "EGF",
zeroTreat = "1stCrtl",
timeRange = c("10min","100min", "24h"))
# extract the assay
exprMat <- SummarizedExperiment::assay(newSEzero)
# call the clustering function
tsc <- clusterTS(x = exprMat, k = 5)
‘tsc’ is a list object which contains two values: ‘cluster’ with tabular time series clustering results and ‘plot’ with the clusters plots.
tsc$cluster
## # A tibble: 580 × 7
## feature time value cluster prob cNum clusterNum
## <fct> <fct> <dbl> <chr> <dbl> <int> <chr>
## 1 s499 Phospho_EGF_10min_rep2 14.1 cluster4 0.431 12 cluster4 (12)
## 2 s499 Phospho_EGF_10min_rep3 14.1 cluster4 0.431 12 cluster4 (12)
## 3 s499 Phospho_EGF_100min_rep2 14.3 cluster4 0.431 12 cluster4 (12)
## 4 s499 Phospho_EGF_100min_rep3 12.5 cluster4 0.431 12 cluster4 (12)
## 5 s499 Phospho_EGF_10min_rep1 8.96 cluster4 0.431 12 cluster4 (12)
## 6 s499 Phospho_EGF_24h_rep1 14.5 cluster4 0.431 12 cluster4 (12)
## 7 s499 Phospho_EGF_100min_rep1 14.6 cluster4 0.431 12 cluster4 (12)
## 8 s499 Phospho_EGF_0min_rep2 14.2 cluster4 0.431 12 cluster4 (12)
## 9 s499 Phospho_EGF_0min_rep3 14.2 cluster4 0.431 12 cluster4 (12)
## 10 s499 Phospho_EGF_0min_rep1 13.6 cluster4 0.431 12 cluster4 (12)
## # ℹ 570 more rows
tsc$plot
A single gene or phospho-site time series data can also be plotted:
timerange <- unique(newSEzero$timepoint)
plotTimeSeries(newSEzero, type = "expression", geneID = "s40",
symbol = "RBM47_T519", condition = "treatment",
treatment = "EGF", timerange = timerange)
Enrichment analysis can be performed on either gene sets (gene-centric) or post-translational modification signature sets (site-centric) that are differentially expressed or in a time-series cluster.
In the latter, each set contains PTM site names with direction of regulation (up- or down-regulated) instead of gene names. Only phosphorylation sites will be considered since this pipeline supports proteomics and phosphoproteomics data.
The gene sets and PTM signature sets are derived from the Molecular Signatures Database (Subramanian, Tamayo et al.,2005; Liberzon et al., 2011, Liberzon et al., 2015) and the PTM Signature Database version 2.0.0 (Krug et al., 2019), respectively. Users are encouraged to consult the PTMsigDB website (https://proteomics.broadapps.org/ptmsigdb/) and the paper of Krug et al. for details on how PTM signature sets were curated and their annotation.
In gene-centric pathway enrichment, users can perform either Parametric Analysis of Gene Set Enrichment (PAGE) (Kim & Volsky, 2005) or Gene Set Enrichment Analysis (GSEA) (Subramanian, Tamayo et al.,2005) with Differential Expression analysis result. With Time series clustering result, we offer the Fisher’s exact test. For phospho-signature enrichment on Differential Expression analysis result, we offer PTM-Signature Enrichment Analysis (PTM-SEA), a method adapted from GSEA by Krug et al. (2019) to be applicable with the PTM Signature Database. On time series clustering result, we offer the Fisher’s exact test, in which each signature set is split into two, one containing upregulated and the other downregulated phosphosites.
# Load the gene set
genesetPath <- system.file("shiny-app/geneset", package = "SmartPhos")
inGMT <- piano::loadGSC(paste0(genesetPath,"/Cancer_Hallmark.gmt"), type="gmt")
# Call the function
resTab <- enrichDifferential(dea = dea$resDE, type = "Pathway enrichment",
gsaMethod = "PAGE", geneSet = inGMT,
statType = "stat", nPerm = 200, sigLevel = 0.05,
ifFDR = FALSE)
## Checking arguments...done!
## Calculating gene set statistics...done!
## Calculating gene set significance...done!
## Adjusting for multiple testing...done!
resTab
## Name Gene Number Stat p.up p.up.adj p.down
## 1 HALLMARK_APICAL_JUNCTION 1 2.5684 0.005108 0.11238 0.99489
## 2 HALLMARK_COMPLEMENT 1 2.0317 0.021090 0.14260 0.97891
## 3 HALLMARK_GLYCOLYSIS 1 2.0317 0.021090 0.14260 0.97891
## 4 HALLMARK_MITOTIC_SPINDLE 4 1.9444 0.025926 0.14260 0.97407
## p.down.adj Number up Number down
## 1 0.99489 1 0
## 2 0.99489 1 0
## 3 0.99489 1 0
## 4 0.99489 4 0
# Load the gene set
genesetPath <- system.file("shiny-app/geneset", package = "SmartPhos")
inGMT <- piano::loadGSC(paste0(genesetPath,
"/Chemical_and_Genetic_Perturbations.gmt"),
type="gmt")
# Call the function
clustEnr <- clusterEnrich(clusterTab = tsc$cluster,
se = newSE, inputSet = inGMT,
filterP = 0.05,
ifFDR = FALSE)
clustEnr
## $table
## # A tibble: 27 × 9
## Name Gene.number Set.size pval Genes padj cluster ifSig atLeast1
## <chr> <int> <int> <dbl> <lis> <dbl> <chr> <lgl> <lgl>
## 1 JOHNSTONE_PA… 3 6 0.00819 <chr> 1 cluste… TRUE TRUE
## 2 BROWN_MYELOI… 2 2 0.0101 <chr> 0.927 cluste… TRUE TRUE
## 3 WANG_CISPLAT… 2 2 0.0101 <chr> 0.927 cluste… TRUE TRUE
## 4 CREIGHTON_EN… 2 2 0.0101 <chr> 0.927 cluste… TRUE TRUE
## 5 YOSHIMURA_MA… 2 2 0.0101 <chr> 0.927 cluste… TRUE TRUE
## 6 IVANOVA_HEMA… 3 7 0.0229 <chr> 0.927 cluste… TRUE TRUE
## 7 WANG_LMO4_TA… 2 2 0.0264 <chr> 1 cluste… TRUE TRUE
## 8 GAUSSMANN_ML… 2 2 0.0264 <chr> 1 cluste… TRUE TRUE
## 9 MCBRYAN_PUBE… 2 2 0.0264 <chr> 1 cluste… TRUE TRUE
## 10 FARMER_BREAS… 2 3 0.0287 <chr> 0.927 cluste… TRUE TRUE
## # ℹ 17 more rows
##
## $plot
# Load the ptm set
ptmsetPath <- system.file("shiny-app/ptmset", package = "SmartPhos")
load(paste0(ptmsetPath, "/human_PTM.rda"))
# Call the function
clustEnr <- clusterEnrich(clusterTab = tsc$cluster, se = newSE,
inputSet = ptmSetDb, ptm = TRUE, filterP = 0.05,
ifFDR = FALSE)
clustEnr
## $table
## # A tibble: 4 × 9
## Name Gene.number Set.size pval Genes padj cluster ifSig atLeast1
## <chr> <int> <int> <dbl> <lis> <dbl> <chr> <lgl> <lgl>
## 1 DISEASE-PSP_P… 1 1 0.0465 <chr> 0.0814 cluste… TRUE TRUE
## 2 DISEASE-PSP_p… 1 1 0.0465 <chr> 0.0814 cluste… TRUE TRUE
## 3 DISEASE-PSP_n… 1 1 0.0465 <chr> 0.0814 cluste… TRUE TRUE
## 4 DISEASE-PSP_c… 1 1 0.0465 <chr> 0.0814 cluste… TRUE TRUE
##
## $plot
SmartPhos can perform kinase activity inference based on phosphopeptides that are differentially expressed or in a cluster. Similar to enrichment analysis, users would first need to perform either a differential expression analysis or time-series clustering and select a cluster of interest.
A network of kinase-phosphosite interactions is constructed using the package OmnipathR (Türei et al., 2021). Users can choose to construct this network with prior knowledge from either Homo sapiens (taxonomy ID = 9606) or Mus musculus (taxonomy ID = 10090).
By combining prior knowledge about known kinase-phosphosite interactions and the data, SmartPhos can infer the activity of the kinases responsible for the phosphopeptides being considered. The activity is estimated by an activity score computed with the package decoupleR (Badia-I-Mompel et al., 2022) following the authors’ tutorial on Kinase and Transcription Factor activity estimation. For time-series clustering, users can also estimate how likely the kinases are associated with phosphopeptides in the selected cluster.
netw <- getDecouplerNetwork("Homo sapiens")
scoreTab <- calcKinaseScore(dea$resDE, netw, statType = "stat", nPerm = 500)
sessionInfo()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SmartPhos_0.99.30 MultiAssayExperiment_1.35.9
## [3] SummarizedExperiment_1.39.2 Biobase_2.69.1
## [5] GenomicRanges_1.61.5 Seqinfo_0.99.2
## [7] IRanges_2.43.5 S4Vectors_0.47.4
## [9] BiocGenerics_0.55.3 generics_0.1.4
## [11] MatrixGenerics_1.21.0 matrixStats_1.5.0
## [13] BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 magrittr_2.0.4
## [4] magick_2.9.0 farver_2.1.2 rmarkdown_2.30
## [7] vctrs_0.6.5 tinytex_0.57 htmltools_0.5.8.1
## [10] S4Arrays_1.9.1 BiocBaseUtils_1.11.2 itertools_0.1-3
## [13] missForest_1.5 decoupleR_2.15.0 SparseArray_1.9.1
## [16] sass_0.4.10 parallelly_1.45.1 KernSmooth_2.23-26
## [19] bslib_0.9.0 htmlwidgets_1.6.4 sandwich_3.1-1
## [22] impute_1.83.0 zoo_1.8-14 cachem_1.1.0
## [25] igraph_2.2.0 mime_0.13 lifecycle_1.0.4
## [28] piano_2.25.0 iterators_1.0.14 pkgconfig_2.0.3
## [31] Matrix_1.7-4 R6_2.6.1 fastmap_1.2.0
## [34] shiny_1.11.1 clue_0.3-66 digest_0.6.37
## [37] pcaMethods_2.1.0 labeling_0.4.3 randomForest_4.7-1.2
## [40] abind_1.4-8 compiler_4.5.1 rngtools_1.5.2
## [43] proxy_0.4-27 withr_3.0.2 doParallel_1.0.17
## [46] marray_1.87.0 S7_0.2.0 BiocParallel_1.43.4
## [49] gplots_3.2.0 MASS_7.3-65 DelayedArray_0.35.3
## [52] gtools_3.9.5 caTools_1.18.3 tools_4.5.1
## [55] httpuv_1.6.16 relations_0.6-15 glue_1.8.0
## [58] promises_1.3.3 grid_4.5.1 cluster_2.1.8.1
## [61] fgsea_1.35.8 gtable_0.3.6 class_7.3-23
## [64] preprocessCore_1.71.2 tidyr_1.3.1 data.table_1.17.8
## [67] utf8_1.2.6 XVector_0.49.1 foreach_1.5.2
## [70] pillar_1.11.1 stringr_1.5.2 limma_3.65.5
## [73] later_1.4.4 splines_4.5.1 gmm_1.9-1
## [76] dplyr_1.1.4 lattice_0.22-7 tidyselect_1.2.1
## [79] knitr_1.50 proDA_1.23.0 bookdown_0.45
## [82] imputeLCMD_2.1 xfun_0.53 shinydashboard_0.7.3
## [85] statmod_1.5.1 pheatmap_1.0.13 DT_0.34.0
## [88] visNetwork_2.1.4 stringi_1.8.7 yaml_2.3.10
## [91] evaluate_1.0.5 codetools_0.2-20 MsCoreUtils_1.21.0
## [94] tibble_3.3.0 BiocManager_1.30.26 cli_3.6.5
## [97] affyio_1.79.0 xtable_1.8-4 jquerylib_0.1.4
## [100] dichromat_2.0-0.1 Rcpp_1.1.0 norm_1.0-11.1
## [103] parallel_4.5.1 sets_1.0-25 ggplot2_4.0.0
## [106] doRNG_1.8.6.2 bitops_1.0-9 mvtnorm_1.3-3
## [109] slam_0.1-55 scales_1.4.0 tmvtnorm_1.7
## [112] affy_1.87.0 e1071_1.7-16 purrr_1.1.0
## [115] crayon_1.5.3 rlang_1.1.6 cowplot_1.2.0
## [118] fastmatch_1.1-6 vsn_3.77.0 shinyjs_2.1.0