\documentclass{article}

%\VignetteIndexEntry{iPAC: identification of Protein Amino acid Mutations}
%\VignetteDepends{gdata, scatterplot3d, Biostrings, multtest}
%\VignetteKeywords{Clusters, Amino Acids, Alignment, CIF,Somatic Mutations, NMC}
%\VignettePackage{iPac}

%% packages
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{subfigure}
\usepackage{float}
\usepackage{caption}
\usepackage{Sweave}


\def \iPAC{\textbf{iPAC}}

\begin{document}

  \title{\iPAC{}: identifcation of Protein Amino acid Mutations }
  \author{Gregory Ryslik  \\ Yale University  \\ gregory.ryslik@yale.edu
          \and
            Hongyu Zhao \\ Yale University \\ hongyu.zhao@yale.edu}

\maketitle

\begin{abstract}
  
  The \iPAC{} package provides a novel tool to identify somatic mutation clustering of amino acids while taking into account their three dimensional structure. Currently, \iPAC{} maps the protein's amino acids into a one dimensional space while preserving, as best as possible, the three dimensional local neighbor relationships. Mutation clusters are then found by considering if pairwise mutations are closer together than expected by chance alone via the the \emph{Nonrandom Mutation Clustering} (NMC) algorithm \citep{ye_2010}. Finally, the clustering results are mapped back onto the original protein and reported back to the user. A paper detailing this methodology and results is currently in preparation. \textit{Additional methodologies based on different algorithms will be added in the future.}
\end{abstract}

\section{Introduction}

Recently, there have been significant pharmacological advances in treating oncongenic driver mutations \citep{croce_oncogenes_2008}. Several methods that rely on amino acid mutational clusters have been developed in order to identify these mutations. One of the most recent methods was presented by \citet{ye_2010}. Their algorithm identifies mutation clusters by calculating whether pairwise mutations are closer on the the line than expected by chance alone when assuming that each amino acid has an equal probability of mutation. As their algorithm relies on considering the protein in linear form, it can potentially exclude clusters that are close together in 3D space but far apart in 1D space. This package is specifically designed to overcome this limitation.

Currently, this package has two methods that deal with the 3D structure of the protein: 1) linear and 2) MDS \citep{borg_modern_1997}. The user should primarily use MDS as it is more statistically rigorous. We include the linear method as an example that the general package is itself flexible. Should the user want to map the protein to 1D space using their own algorithm, they can thus do so. 

If users want to contribute to the code base, please contact the author.

\section{The NMC Algorithm}

The NMC algorithm, proposed by \citet{ye_2010}, finds mutational clusters when the protein is considered to be a straight line. While the full alogrithm is presented in their paper, we provide a brief overview here for completeness.

Suppose that the protein was $N$ amino acids long and that each amino acid had a $\frac{1}{N}$ probability of mutation. We can then construct order statistics over many samples as follows:

\begin{figure}[h!]
  \centering
  \includegraphics{FakeProtein2.jpg}
  \caption{Three samples of the same protein. An asterisk above a number indicates a non-synonomous mutation in that sample for that amino acid.}
  \label{fig1}
\end{figure}

Letting $R_{k,i} = X_{(k)}-X_{(i)}$, one can calculate if the $Pr(R_{k,i}\leq r) \leq \alpha$ using well known results about order statistics on the uniform distribution. While discrete formulas exist for $Pr(R_{k,i}\leq r)$, they are often too costly to calculate when $R_{k,i}>1$. In these cases, we scale the protein onto the interval (0,1) by calculating $Pr(\frac{X_{(k)}-X_{(i)}}{N} \leq r)$ which turns out to equal $ Pr(Beta(k-i, i+n-k+1) \leq r)$.  Finally, since this calculation is done for every pair of mutations in the protein, a multiple comparisons adjustment is performed.

The original NMC algorithm is included in this package via the \emph{nmc} command. We provide an example of its use below.

First, we load \iPAC{} and then the mutation matrix. The mutation matrix is a matrix of 0's and 1's where each column represents an amino acid in the protein and each row represents a sample (or a mutation). Thus, the entry for row i column j, represents the ith sample (or mutation) and the jth amino acid.

\begin{verbatim}
Code Example 1: Running the NMC algorithm
\end{verbatim}
<<label = Example1, echo= true, eval=FALSE>>=
library(iPAC)
#For more information on the mutations matrix, 
#type ?KRAS.Mutations after executing the line below.
data(KRAS.Mutations)
nmc(KRAS.Mutations, alpha = 0.05, multtest = "Bonferroni")
@

\begin{Schunk}
\begin{Soutput}
     cluster_size start end number       p_value
V12             2    12  13    131 1.979447e-235
V12             1    12  12    100 6.486735e-188
V12            11    12  22    132 3.220145e-145
V12            12    12  23    133 6.524053e-142
V12            50    12  61    138  4.338908e-65
V13             1    13  13     31  2.732914e-39
V12           106    12 117    139  2.341227e-23
V12           135    12 146    149  1.356584e-20
V13            10    13  22     32  4.487362e-12
V13            11    13  23     33  1.279256e-11
V146            1   146 146     10  1.918440e-08
\end{Soutput}
\end{Schunk}


The results from \emph{Code Example 1} show all the statistically significant clusters found, including the size of the cluster, the start and end positions and the number of mutations in that cluster.

\section{Remapping Algorithm}
\subsection{Matching the Mutation and Position Information}
Before we can run the 3D clustering algorithm while, we first need to come up with an alignment between the mutational information provided by a source such as COSMIC \citep{forbes_catalogue_2008} and the positional information provided by a source such as the PDB \citep{pdb}. Such an alignment is necessary because mutational information is typically provided on the ``canonical" amino acid numbering which often differs from the numbering used in the PDB database. Thus amino acid {\#}i from the PDB database might not be amino acid {\#}i from the mutational database.

To solve this problem, we consider the mutational database to contain the ``canonical" ordering of the protein. We then attempt to map the structural information to the canonical ordering and create a new matrix of residues, their canonical counts, and their positions in 3D space. If successful, we then have a relational structure between the two databases allowing us to refer to amino acid {\#}i where i represents the same amino acid in both databases.

We have created two methods that allow one to construct such a matrix: 1)\emph{get.Positions} and 2)\emph{get.AlignedPositions}. 

The first method, \emph{get.Positions} attemps to create the position matrix directly from the CIF file in the PDB database. It returns a list of several items, the first of which is \$Positions, which must later be passed to the \emph{ClusterFind} method. Due to the complexity of CIF files, \emph{get.Positions} currently works on approximately 70\% of the structures in the PDB database.

The second method, \emph{get.AlignedPositions} performs a pairwise alignment algorithm to align the canonical protein ordering with the XYZ positions in the PDB. Since \emph{get.AlignedPositions} runs an alignment algorithm, the ordering might not be perfect and we recommend the user to verify the results. However, from our testing, the alignment procedure works quite well. Furthermore, since \emph{get.AlignedPositions} does not have to consider as many aspects of the CIF file, it is more robust and often works when \emph{get.Positions} fails. 

Let us first consider the \emph{get.Positions} function. We will consider three examples, one for KRAS protein and two for the PIK3CA protein. For each example, we need to input the location of the CIF file (this holds the structural information), the location of the FASTA file (this holds the canonical protein sequence) and the sidechain that we want to use from the CIF file.

As the entire position sequence is too long to print, we first save the result and then print the first 10 rows of the position matrix. The remaining elements of the result are printed in full.

\begin{verbatim}
Code Example 2: Extracting positions using the get.Positions function
\end{verbatim}
<<label=Example2,echo= true, eval = FALSE>>=
library(iPAC)
CIF<-"https://files.rcsb.org/view/3GFT.cif"
Fasta<-"https://www.uniprot.org/uniprot/P01116-2.fasta"
KRAS.Positions<-get.Positions(CIF, Fasta, "A")
names(KRAS.Positions)
@
\begin{Schunk}
\begin{Soutput}
[1] "Positions"         "External.Mismatch" "PDB.Mismatch"     
[4] "Result"           
\end{Soutput}
\end{Schunk}

<<label=Example2a,echo= true, eval = FALSE>>=
KRAS.Positions$Positions[1:10,]
@
\begin{Schunk}
\begin{Soutput}
   Residue Can.Count SideChain XCoord  YCoord ZCoord
1      MET         1         A 62.935  97.579 30.223
2      THR         2         A 63.155  95.525 27.079
3      GLU         3         A 65.289  96.895 24.308
4      TYR         4         A 64.899  96.220 20.615
5      LYS         5         A 67.593  96.715 18.023
6      LEU         6         A 65.898  97.863 14.816
7      VAL         7         A 67.664  98.557 11.533
8      VAL         8         A 66.263 100.550  8.617
9      VAL         9         A 67.484  99.500  5.194
10     GLY        10         A 66.575 100.328  1.605
\end{Soutput}
\end{Schunk}


<<label=Example2b,echo= true, eval = FALSE>>=
KRAS.Positions$External.Mismatch
@
\begin{Schunk}
\begin{Soutput}
  PDB.Residue Canonical.Residue Canonical.Num
1           H                 Q            61
\end{Soutput}
\end{Schunk}

<<label=Example2c,echo= true, eval = FALSE>>=
KRAS.Positions$PDB.Mismatch
@
\begin{Schunk}
\begin{Soutput}
   PDB.Residue Canonical.Residue Canonical.Num         Remark
19           H                 Q            61 SEE REMARK 999
\end{Soutput}
\end{Schunk}


<<label=Example2d,echo= true, eval = FALSE>>=
KRAS.Positions$Result
@
\begin{Schunk}
\begin{Soutput}
[1] "OK"
\end{Soutput}
\end{Schunk}

Observe that the final element in \emph{Code Example 2} is ``OK". That is because the only mismatched residue (at position 61), was documented in the CIF file as well.
Thus it is considered a ``reconciled" mismatch. It is up to the user to decide if they want to include it in the position sequence that is passed on to the \emph{ClusterFind} method or to remove it.


\begin{verbatim}
Code Example 3: Final example of the get.Positions function
\end{verbatim}
<<label = Example4, echo= true, eval=FALSE>>=
CIF <- "https://files.rcsb.org/view/2RD0.cif"
Fasta <- "https://www.uniprot.org/uniprot/P42336.fasta"
PIK3CAV2.Positions <- get.Positions(CIF, Fasta, "A")
names(PIK3CAV2.Positions)
@
\begin{Schunk}
\begin{Soutput}
[1] "Positions"         "External.Mismatch" "PDB.Mismatch"     
[4] "Result"           
\end{Soutput}
\end{Schunk}

<<label = Example4a, echo=true,eval=FALSE>>=
PIK3CAV2.Positions$Positions[1:10,]
@
\begin{Schunk}
\begin{Soutput}
   Residue Can.Count SideChain XCoord YCoord  ZCoord
1      GLY         8         A 88.344 61.306 112.918
2      GLU         9         A 90.119 58.543 111.029
3      LEU        10         A 92.954 56.400 109.709
4      TRP        11         A 93.105 53.251 107.542
5      GLY        12         A 91.616 50.221 109.372
6      ILE        13         A 90.825 52.285 112.474
7      HIS        14         A 87.540 54.192 112.953
8      LEU        15         A 88.806 56.633 115.544
9      MET        16         A 92.435 57.520 116.178
10     PRO        17         A 93.481 57.378 119.852
\end{Soutput}
\end{Schunk}
<<label = Example4b, echo=true,eval=FALSE>>=
PIK3CAV2.Positions$External.Mismatch
@
\begin{Schunk}
\begin{Soutput}
NULL
\end{Soutput}
\end{Schunk}

<<label = Example4c, echo=true,eval=FALSE>>=
PIK3CAV2.Positions$PDB.Mismatch
@
\begin{Schunk}
\begin{Soutput}
NULL
\end{Soutput}
\end{Schunk}
<<label = Example4d, echo=true,eval=FALSE>>=
PIK3CAV2.Positions$Result
@
\begin{Schunk}
\begin{Soutput}
[1] "OK"
\end{Soutput}
\end{Schunk}

Observe that the final result in \emph{Code Example 3} is ``OK". Here we use a different file location for the canonical sequence -- the UNIPROT database. Here, the canonical sequence is slightly different and matches up exactly to the extracted positions. As there is only 1 isoform listed on UNIPROT for PIK3CA we suggest using the same source for both the mutational and canonical position information. For example, if your mutation data was obtained from COSMIC, you should use COSMIC to get the canonical protein sequence. 

Let us now consider the \emph{get.AlignedPositions} function. This function automatically drops positions that do not match up.


\begin{verbatim}
Code Example 4: Extracting positions using the get.AlignedPositions function
\end{verbatim}
<<label =  Example5, echo= true,eval=FALSE>>=
CIF<- "https://files.rcsb.org/view/2RD0.cif"
Fasta <- "https://www.uniprot.org/uniprot/P42336.fasta"
PIK3CAV3.Positions<-get.AlignedPositions(CIF,Fasta , "A")
names(PIK3CAV3.Positions)
@
\begin{Schunk}
\begin{Soutput}
[1] "Positions"        "Diff.Count"       "Diff.Positions"   "Alignment.Result"
[5] "Result"          
\end{Soutput}
\end{Schunk}

<<label =  Example5a, echo= true,eval=FALSE>>=
PIK3CAV3.Positions$Positions[1:10,]
@
\begin{Schunk}
\begin{Soutput}
   Residue Can.Count SideChain XCoord YCoord  ZCoord
14     GLY         8         A 88.344 61.306 112.918
15     GLU         9         A 90.119 58.543 111.029
16     LEU        10         A 92.954 56.400 109.709
17     TRP        11         A 93.105 53.251 107.542
18     GLY        12         A 91.616 50.221 109.372
19     ILE        13         A 90.825 52.285 112.474
20     HIS        14         A 87.540 54.192 112.953
21     LEU        15         A 88.806 56.633 115.544
22     MET        16         A 92.435 57.520 116.178
23     PRO        17         A 93.481 57.378 119.852
\end{Soutput}
\end{Schunk}

<<label =  Example5b, echo= true,eval=FALSE>>=
PIK3CAV3.Positions$External.Mismatch
@
\begin{Schunk}
\begin{Soutput}
NULL
\end{Soutput}
\end{Schunk}
<<label =  Example5c, echo= true,eval=FALSE>>=
PIK3CAV3.Positions$PDB.Mismatch
@
\begin{Schunk}
\begin{Soutput}
NULL
\end{Soutput}
\end{Schunk}
<<label =  Example5d, echo= true,eval=FALSE>>=
PIK3CAV3.Positions$Result
@
\begin{Schunk}
\begin{Soutput}
[1] "OK"
\end{Soutput}
\end{Schunk}

Both \emph{get.AlignedPositions} and \emph{get.Positions} are still in beta and are provided to the user for convenience only. Changes by the PDB or COSMIC to their file structure might result in errors and it is up to the user to ensure the correct data is supplied to the \emph{ClusterFind} function. 

\subsection{Finding Clusters in 3D Space}

Now that we have the positional data, we can find the mutational clusters while taking into account the 3D protein structure. We begin by slecting a method to map the protein down to a 1D space. \newline \newline The first method, ``linear", fixes a specified point $(x_0,y_0,z_0)$ and then calculates the distance from each amino acid to that point. The amino acids are then rearranged in order from the shortest distance to the longest distance. The second method, ``MDS", uses Multidimensional Scaling \citep{borg_modern_1997} to map the protein to a 1D space. We strongly encourage the user to employ the MDS method as it is more statistically rigorous. The linear method is provided as an example to show how other mapping paradigms might be implemented. \newline

A diagram of either the MDS or linear mapping can be displayed when the \emph{ClusterFind} method is run. As the mapping algorithms are different, the mapping images created are different as well. The linear method will generate an image of the distances from $(x_0,y_0,z_0)$ to each amino acid. These distance will be drawn as dotted green lines from each amino acid to the fixed point. Conversely, the MDS methodology will create lines from each amino acid to the x-axis which will mark where on the line the amino acid is positioned. 

We begin by first running the algorithm on KRAS using the MDS method followed by the linear method. For a full list of all the possible parameters, simply type `?ClusterFind' after loading the \iPAC{} library. 


\begin{verbatim}
Code Example 5: Running the ClusterFind method with a MDS mapping
\end{verbatim}
<<label=Example6,echo=true, fig=TRUE,eval = FALSE>>=
#Extract the data from a CIF file and match it up with the canonical protein sequence.
#Here we use the 3GFT structure from the PDB, which corresponds to the KRAS protein.
CIF<-"https://files.rcsb.org/view/3GFT.cif"
Fasta<-"https://www.uniprot.org/uniprot/P01116-2.fasta"
KRAS.Positions<-get.Positions(CIF,Fasta, "A")

#Load the mutational data for KRAS. Here the mutational data was obtained from the
#COSMIC database (version 58). 
data(KRAS.Mutations)

#Identify and report the clusters. 
ClusterFind(mutation.data=KRAS.Mutations,
            position.data=KRAS.Positions$Positions,
            create.map = "Y", Show.Graph = "Y",
            Graph.Title = "MDS Mapping",
            method = "MDS")
@
\begin{Schunk}
\begin{Soutput}
[1] "Running Remapped"
[1] "Running Full"
[1] "Running Culled"
$Remapped
     cluster_size start end number       p_value
V136            1    12  12    100 8.932390e-183
V136            2    12  13    131 3.908116e-165
V121           49    13  61     38 3.097954e-124
V124          134    13 146     49 1.017093e-122
V136          106    12 117    139 1.362730e-119
V121           57    61 117      6 3.016259e-106
V124           30   117 146     11 1.666182e-102
V124          135    12 146    149  3.813215e-90
V121           50    12  61    138  2.682445e-87
V93            10    13  22     32  3.434148e-73
V93            96    22 117      8  1.140497e-65
V73            11    13  23     33  1.429105e-53
V73            95    23 117      7  4.204875e-48
V93            11    12  22    132  3.398568e-39
V142            1    13  13     31  8.571798e-38
V73            12    12  23    133  1.963564e-23
V142          105    13 117     39  4.563510e-12
V124            1   146 146     10  3.897757e-08
V121           86    61 146     16  6.115605e-07

$OriginalCulled
     cluster_size start end number       p_value
V12             2    12  13    131 9.453887e-229
V12             1    12  12    100 7.630495e-183
V12            11    12  22    132 1.554973e-138
V12            12    12  23    133 3.526333e-135
V12            50    12  61    138  2.824800e-58
V13             1    13  13     31  8.857871e-38
V12           106    12 117    139  4.538089e-17
V12           135    12 146    149  3.853241e-13
V13            10    13  22     32  8.603544e-11
V13            11    13  23     33  2.553752e-10
V146            1   146 146     10  5.331155e-08

$Original
     cluster_size start end number       p_value
V12             2    12  13    131 1.979447e-235
V12             1    12  12    100 6.486735e-188
V12            11    12  22    132 3.220145e-145
V12            12    12  23    133 6.524053e-142
V12            50    12  61    138  4.338908e-65
V13             1    13  13     31  2.732914e-39
V12           106    12 117    139  2.341227e-23
V12           135    12 146    149  1.356584e-20
V13            10    13  22     32  4.487362e-12
V13            11    13  23     33  1.279256e-11
V146            1   146 146     10  1.918440e-08

$MissingPositions
     LHS RHS
[1,] 168 188
\end{Soutput}
\end{Schunk}

\includegraphics{iPAC-Example6}

\begin{verbatim}
Code Example 6: Running the ClusterFind method with a Linear mapping
\end{verbatim}
<<label=Example7,echo=true, fig=TRUE,eval=FALSE>>=
#Extract the data from a CIF file and match it up with the canonical protein sequence.
#Here we use the 3GFT structure from the PDB, which corresponds to the KRAS protein.
CIF<-"https://files.rcsb.org/view/3GFT.cif"
Fasta<-"https://www.uniprot.org/uniprot/P01116-2.fasta"
KRAS.Positions<-get.Positions(CIF,Fasta, "A")

#Load the mutational data for KRAS. Here the mutational data was obtained from the
#COSMIC database (version 58). 
data(KRAS.Mutations)

#Identify and report the clusters. 
ClusterFind(mutation.data=KRAS.Mutations,
            position.data=KRAS.Positions$Positions,
            create.map = "Y", Show.Graph = "Y",
            Graph.Title = "Linear Mapping",
            method = "Linear")
@
\begin{Schunk}
\begin{Soutput}
[1] "Running Remapped"
[1] "Running Full"
[1] "Running Culled"
$Remapped
     cluster_size start end number       p_value
V90             1    12  12    100 8.862875e-183
V90             2    12  13    131 2.234056e-175
V90           135    12 146    149 2.965106e-158
V90            50    12  61    138 1.092881e-149
V90            11    12  22    132  9.127863e-90
V66            57    61 117      6  3.393497e-89
V90            12    12  23    133  3.393497e-89
V66            30   117 146     11  2.804316e-86
V66           105    13 117     39  5.838579e-79
V66            96    22 117      8  8.125462e-61
V66            95    23 117      7  2.427091e-60
V66           106    12 117    139  2.047918e-48
V95             1    13  13     31  8.857871e-38
V95           134    13 146     49  1.438417e-26
V95            49    13  61     38  9.252899e-22
V100            1   146 146     10  3.897757e-08
V100           86    61 146     16  1.607373e-05
V95            11    13  23     33  8.412719e-04
V95            10    13  22     32  8.663956e-04

$OriginalCulled
     cluster_size start end number       p_value
V12             2    12  13    131 9.453887e-229
V12             1    12  12    100 7.630495e-183
V12            11    12  22    132 1.554973e-138
V12            12    12  23    133 3.526333e-135
V12            50    12  61    138  2.824800e-58
V13             1    13  13     31  8.857871e-38
V12           106    12 117    139  4.538089e-17
V12           135    12 146    149  3.853241e-13
V13            10    13  22     32  8.603544e-11
V13            11    13  23     33  2.553752e-10
V146            1   146 146     10  5.331155e-08

$Original
     cluster_size start end number       p_value
V12             2    12  13    131 1.979447e-235
V12             1    12  12    100 6.486735e-188
V12            11    12  22    132 3.220145e-145
V12            12    12  23    133 6.524053e-142
V12            50    12  61    138  4.338908e-65
V13             1    13  13     31  2.732914e-39
V12           106    12 117    139  2.341227e-23
V12           135    12 146    149  1.356584e-20
V13            10    13  22     32  4.487362e-12
V13            11    13  23     33  1.279256e-11
V146            1   146 146     10  1.918440e-08

$MissingPositions
     LHS RHS
[1,] 168 188
\end{Soutput}
\end{Schunk}
\includegraphics{iPAC-Example7}

As can be seen from \emph{Code Examples 5} and \emph{6} above, the \emph{ClusterFind} method returns a list of four elements. The first element, \$Remapped, displays all the clusters found while taking into account the 3D structure of the protein by utilizing either the linear or MDS methodology. The next element, \$OriginalCulled, displays all the clusters found using the original NMC algorithm after removing all the amino acids for which we do not have $(x,y,z)$ positions. The \$Original element displays all the clusters found using the NMC algorithm without removing the amino acids for which we do not have 3D positional information. 

If the user wants to compare the results generated when taking protein structure into account versus those when protein structure is ignored, it is recommended that the user compare the matrices in \$Remapped versus \$OriginalCulled. In this way, the user is considering the differences that arise strictly from the protein structure as the amino acids with missing 3D positions have been removed prior to the analysis.

Finally, the \$MissingPositions element displays a matrix of all the amino acids for which we had mutational data but for which we did not have positional data. For instance, in \emph{Code Example 5}, the mutation data matrix had 188 columns while we were able to extract positional information only for amino acids 1-167. Furthermore, amino acid 61 was excluded from the final position matrix when the \emph{get.AlignedPositions} function was run as the amino acid listed in the CIF file did not match the canonical sequence in the FASTA file. As such, the \$MissingPositions element has a matrix of 2 rows as follows:

\begin{verbatim}
$MissingPositions
     LHS RHS
[1,]  61  61
[2,] 168 188
\end{verbatim}


\bibliography{refs}{}
\bibliographystyle{plainnat}

\end{document}