1 Introduction

Johnson et al. (Johnson et al. 2023) published for 303 human serine/threonine specific kinases substrate affinities in the form of position-specific weight matrices (PWMs). The JohnsonKinaseData package provides access to these PWMs including basic functionality to match user-provided phosphosites against all kinase PWMs. The aim is to give the user a simple way of predicting kinase-substrate relationships based on PWM-phosphosite matching. These predictions can serve to infer kinase activity from differential phospho-proteomic data.

2 Installation

The JohnsonKinaseData package can be install using the following code:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ExperimentHub")
BiocManager::install("JohnsonKinaseData")

3 Loading kinase PWMs

The kinase PWMs can be accessed with the getKinasePWM() function. It returns a list with 303 human serine/threonine specific PWMs.

library(JohnsonKinaseData)
pwms <- getKinasePWM()
#> see ?JohnsonKinaseData and browseVignettes('JohnsonKinaseData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache

head(names(pwms))
#> [1] "AAK1"   "ACVR2A" "ACVR2B" "AKT1"   "AKT2"   "AKT3"

Each PWM is a numeric matrix with amino acids as rows and positions as columns. Matrix elements are log2-odd scores measuring differential affinity relative to a random frequency of amino acids (Johnson et al. 2023).

pwms[["PLK2"]]
#>             -5           -4          -3         -2           -1           0
#> A -0.036821844 -0.277009455 -0.83856373 -0.4463446 -0.186229068          NA
#> C  0.009633819 -0.034899138 -0.24690897  0.4799548 -0.467333943          NA
#> D  0.549718451  0.795766948  0.82130204  1.6459783  1.329410671          NA
#> E  0.614756952  1.127897364  2.86862751  1.2354207  0.689388627          NA
#> F  0.449006639  0.078199920 -0.41273103 -0.9773836 -0.602963759          NA
#> G  0.326652391 -0.151522275 -0.77793738 -0.6106535 -0.767584829          NA
#> H  0.148478616 -0.172018427 -0.67807191 -0.3219281  0.214995135          NA
#> I -0.311864412 -0.172018427 -1.65154094 -0.8406292 -0.519941731          NA
#> K -0.469329925 -0.647467443 -1.77349147 -1.7345631 -0.656307931          NA
#> L -0.245197993  0.144568518 -0.71785677  0.3032255 -0.511690664          NA
#> M -0.248793390 -0.206894852 -0.38948891  0.3123167 -0.194955239          NA
#> N -0.065823218  0.002018361 -0.54077824  0.9076598  0.307545102          NA
#> P -0.066578437 -0.108114249 -1.05139915 -0.4418303  0.542703792          NA
#> Q -0.530739153 -0.241782116 -0.48096139 -0.1800049 -0.264477823          NA
#> R -0.528032212 -0.715485867 -1.58640592 -1.1059389 -0.339345148          NA
#> S -0.065823218 -0.172018427 -0.77793738 -0.4463446 -0.194955239  0.00000000
#> T -0.065823218 -0.172018427 -0.77793738 -0.4463446 -0.194955239 -0.09585422
#> V -0.401253684 -0.367545642 -1.89324968 -1.3562361 -0.152804813          NA
#> W -0.034160317 -0.140189435 -1.05799229 -1.1256358 -1.093879047          NA
#> Y  0.083383588 -0.242293983 -1.12217724 -0.5640514 -0.004045212          NA
#> s  0.059632160  0.750692249  0.06873959  0.1075540  0.101650076          NA
#> t  0.059632160  0.750692249  0.06873959  0.1075540  0.101650076          NA
#> y  0.707878133  0.679784089  0.26351522 -0.1321035  2.184534212          NA
#>              1            2           3           4
#> A -0.812485602 -0.109981413 -0.53574997 -0.33515312
#> C -0.310253562  0.145612247  0.00000000  0.04362448
#> D -0.942307133  1.124791311  1.17957474  0.98389654
#> E -0.201410261  1.154194325  1.37389873  1.13638828
#> F  1.906390375 -0.122334266 -0.21541226 -0.12610808
#> G -0.918660373 -0.888701547 -0.30329392 -0.24827921
#> H -0.671163536 -0.002165667 -0.13020754 -0.01785518
#> I  0.374065718 -0.042308229 -0.25963366 -0.03785821
#> K -1.145924538 -2.141143704 -1.48196851 -1.17755536
#> L  0.032665112 -0.500013836 -0.19379970 -0.02664588
#> M  0.833902077  0.008200014 -0.23463499 -0.20273795
#> N -0.818579360 -0.015082595  0.07710624 -0.20706138
#> P -2.650181828 -0.911044318 -0.71667083  0.10218779
#> Q  0.266756562 -0.411003598 -0.01873185 -0.18852897
#> R -0.532824877 -1.190338611 -1.33715648 -1.18082233
#> S -0.532824877 -0.109981413 -0.21541226 -0.12610808
#> T -0.532824877 -0.109981413 -0.21541226 -0.12610808
#> V -0.008682243 -0.249993850 -0.38571419 -0.85152138
#> W -0.550465037  0.385154897  0.11769504  0.30836088
#> Y  0.360757558  0.526569660  0.07546417 -0.04751733
#> s  0.412402175  1.196984664  1.25574242  1.70655265
#> t  0.412402175  1.196984664  1.25574242  1.70655265
#> y  0.490467444  3.461305904  1.53012070  1.85199884

Beside the 20 standard amino acids, also phosphorylated serine, threonine and tyrosine residues are included. These phosphorylated residues are distinct from the central phospho-acceptor (serine/threonine at position 0) and can have a strong impact on the affinity of a given kinase-substrate pair (phospho-priming).

The central phospho-acceptor site is located at position 0 and only measures the favorability of serine over threonine. The user can exclude this favorability measure by setting the parameter includeSTfavorability to FALSE, in which case the central position doesn’t contribute to the PWM score.

pwms2 <- getKinasePWM(includeSTfavorability=FALSE)
#> see ?JohnsonKinaseData and browseVignettes('JohnsonKinaseData') for documentation
#> loading from cache

4 Processing user-provided phosphosites

Phosphorylated peptides are often represented in two different formats: (1) the phosphorylated residues are indicated by an asterix as in SAGLLS*DEDC. Alternatively, phosphorylated residues are given by lower case letters as in SAGLLsDEDC. In order to unify the phosophosite representation for PWM matching, JohnsonKinaseData provides the function processPhosphopeptides(). It takes a character vector with phospho-peptides, aligns them to the central phospho-acceptor position and pads and/or truncates the surrounding residues, such that the processed site consists of 5 upstream residues, a central acceptor and 4 downstream residues. The central phospho-acceptor position is defined as the left closest position to the midpoint of the peptide given by floor(nchar(sites)/2)+1.

ppeps <- c("SAGLLS*DEDC", "GDS*ND", "EKGDSN__", "___LySDEDC", "EKGtS*N")

sites <- processPhosphopeptides(ppeps)
#> Warning in processPhosphopeptides(ppeps): No S/T at central phospho-acceptor
#> position.

sites
#> # A tibble: 5 × 3
#>   sites       processed  acceptor
#>   <chr>       <chr>      <chr>   
#> 1 SAGLLS*DEDC SAGLLSDEDC S       
#> 2 GDS*ND      ___GDSND__ S       
#> 3 EKGDSN__    _EKGDSN___ S       
#> 4 ___LySDEDC  ____LYSDED Y       
#> 5 EKGtS*N     __EKGTsN__ T

If a peptide contains several phosphorylated residues, option onlyCentralAcceptor controls how to select the acceptor position. Setting onlyCentralAcceptor=FALSE will return all possible aligned phosphosites for a given input peptide. Note that in this case the output is not parallel to the input.

processPhosphopeptides("KART*LLS*DEC")
#> # A tibble: 1 × 3
#>   sites        processed  acceptor
#>   <chr>        <chr>      <chr>   
#> 1 KART*LLS*DEC ARtLLSDEC_ S
processPhosphopeptides("KART*LLS*DEC", onlyCentralAcceptor=FALSE)
#> # A tibble: 2 × 3
#>   sites        processed  acceptor
#>   <chr>        <chr>      <chr>   
#> 1 KART*LLS*DEC __KARTLLsD T       
#> 2 KART*LLS*DEC ARtLLSDEC_ S

5 Scoring of user-provided phosphosites

Once peptides are processed to sites, the function scorePhosphosites() can be used to create a matrix of kinase-substrate match scores.

selected <- sites |> 
  dplyr::filter(acceptor %in% c('S','T')) |> 
  dplyr::pull(processed)

scores <- scorePhosphosites(pwms, selected)

dim(scores)
#> [1]   4 303

head(scores[,1:5])
#>                 AAK1     ACVR2A     ACVR2B       AKT1       AKT2
#> SAGLLSDEDC -6.794078 -0.1666423  0.3039018 -5.8821117 -4.7783302
#> ___GDSND__ -8.107231 -1.0652463 -0.6211398 -2.2011502 -1.7940957
#> _EKGDSN___ -8.274386 -1.5402977 -0.9296051 -0.6188352 -0.8554523
#> __EKGTsN__ -2.159839  0.7307256  0.8912120 -2.7357203 -1.2022251

The PWM scoring can be parallelized by supplying a BiocParallelParam object to BPPARAM=.

scores <- scorePhosphosites(pwms, selected, BPPARAM=BiocParallel::SerialParam())

By default, the resulting score is the log2-odds score of the PWM. Alternatively, by setting scoreType="percentile", a percentile rank of the log2-odds score is calculated, using for each PWM a background score distribution which is derived by matching each PWM to the 85’603 unique phosphosites published in Johnson et al. 2023.

scores <- scorePhosphosites(pwms, selected, scoreType="percentile")
#> see ?JohnsonKinaseData and browseVignettes('JohnsonKinaseData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache

head(scores[,1:5])
#>                 AAK1   ACVR2A   ACVR2B     AKT1     AKT2
#> SAGLLSDEDC 22.375586 79.73910 83.79933 14.73447 14.59609
#> ___GDSND__  9.030272 67.08203 74.19169 64.63912 64.38251
#> _EKGDSN___  7.927565 57.36739 69.80942 79.14942 74.56646
#> __EKGTsN__ 83.891454 87.55389 88.32535 57.76235 71.31353

Quantifying PWM matches by percentile rank was first described in Jaffe et al. 2001 (???). It is also the matching score underlying the kinase activity predictions published in Johnson et al. 2023 (Johnson et al. 2023).

Note that these percentile ranks cannot not account for phospho-priming, as non-central phosphorylated residues were missing in the background sites published in Johnson et al. I.e. the score distributions derived from the background sites do not reflect the impact of phospho-priming.

6 Session info

sessionInfo()
#> R Under development (unstable) (2024-03-18 r86148)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] JohnsonKinaseData_0.99.0 BiocStyle_2.31.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] KEGGREST_1.43.0         xfun_0.42               bslib_0.6.1            
#>  [4] Biobase_2.63.0          vctrs_0.6.5             tools_4.4.0            
#>  [7] generics_0.1.3          stats4_4.4.0            curl_5.2.1             
#> [10] parallel_4.4.0          tibble_3.2.1            fansi_1.0.6            
#> [13] AnnotationDbi_1.65.2    RSQLite_2.3.5           blob_1.2.4             
#> [16] pkgconfig_2.0.3         checkmate_2.3.1         dbplyr_2.5.0           
#> [19] S4Vectors_0.41.5        lifecycle_1.0.4         GenomeInfoDbData_1.2.11
#> [22] stringr_1.5.1           compiler_4.4.0          Biostrings_2.71.4      
#> [25] codetools_0.2-19        GenomeInfoDb_1.39.9     htmltools_0.5.7        
#> [28] sass_0.4.9              yaml_2.3.8              tidyr_1.3.1            
#> [31] pillar_1.9.0            crayon_1.5.2            jquerylib_0.1.4        
#> [34] BiocParallel_1.37.1     cachem_1.0.8            mime_0.12              
#> [37] ExperimentHub_2.11.1    AnnotationHub_3.11.3    tidyselect_1.2.1       
#> [40] digest_0.6.35           stringi_1.8.3           purrr_1.0.2            
#> [43] dplyr_1.1.4             bookdown_0.38           BiocVersion_3.19.1     
#> [46] fastmap_1.1.1           cli_3.6.2               magrittr_2.0.3         
#> [49] utf8_1.2.4              withr_3.0.0             backports_1.4.1        
#> [52] filelock_1.0.3          rappdirs_0.3.3          bit64_4.0.5            
#> [55] rmarkdown_2.26          XVector_0.43.1          httr_1.4.7             
#> [58] bit_4.0.5               png_0.1-8               memoise_2.0.1          
#> [61] evaluate_0.23           knitr_1.45              IRanges_2.37.1         
#> [64] BiocFileCache_2.11.1    rlang_1.1.3             glue_1.7.0             
#> [67] DBI_1.2.2               BiocManager_1.30.22     BiocGenerics_0.49.1    
#> [70] jsonlite_1.8.8          R6_2.5.1                zlibbioc_1.49.3

Johnson, Jared L., Tomer M. Yaron, Emily M. Huntsman, Alexander Kerelsky, Junho Song, Amit Regev, Ting-Yu Lin, et al. 2023. “An Atlas of Substrate Specificities for the Human Serine/Threonine Kinome.” Journal Article. Nature 613 (7945): 759–66. https://doi.org/10.1038/s41586-022-05575-3.