\documentclass{article} %\VignetteIndexEntry{iPAC: identification of Protein Amino acid Mutations} %\VignetteDepends{gdata, scatterplot3d, Biostrings, multtest} %\VignetteKeywords{Clusters, Amino Acids, Alignment, CIF,Somatic Mutations, NMC} %\VignettePackage{iPac} %% packages \usepackage{graphicx} \usepackage{natbib} \usepackage{subfigure} \usepackage{float} \usepackage{caption} \usepackage{Sweave} \def \iPAC{\textbf{iPAC}} \begin{document} \title{\iPAC{}: identifcation of Protein Amino acid Mutations } \author{Gregory Ryslik \\ Yale University \\ gregory.ryslik@yale.edu \and Hongyu Zhao \\ Yale University \\ hongyu.zhao@yale.edu} \maketitle \begin{abstract} The \iPAC{} package provides a novel tool to identify somatic mutation clustering of amino acids while taking into account their three dimensional structure. Currently, \iPAC{} maps the protein's amino acids into a one dimensional space while preserving, as best as possible, the three dimensional local neighbor relationships. Mutation clusters are then found by considering if pairwise mutations are closer together than expected by chance alone via the the \emph{Nonrandom Mutation Clustering} (NMC) algorithm \citep{ye_2010}. Finally, the clustering results are mapped back onto the original protein and reported back to the user. A paper detailing this methodology and results is currently in preparation. \textit{Additional methodologies based on different algorithms will be added in the future.} \end{abstract} \section{Introduction} Recently, there have been significant pharmacological advances in treating oncongenic driver mutations \citep{croce_oncogenes_2008}. Several methods that rely on amino acid mutational clusters have been developed in order to identify these mutations. One of the most recent methods was presented by \citet{ye_2010}. Their algorithm identifies mutation clusters by calculating whether pairwise mutations are closer on the the line than expected by chance alone when assuming that each amino acid has an equal probability of mutation. As their algorithm relies on considering the protein in linear form, it can potentially exclude clusters that are close together in 3D space but far apart in 1D space. This package is specifically designed to overcome this limitation. Currently, this package has two methods that deal with the 3D structure of the protein: 1) linear and 2) MDS \citep{borg_modern_1997}. The user should primarily use MDS as it is more statistically rigorous. We include the linear method as an example that the general package is itself flexible. Should the user want to map the protein to 1D space using their own algorithm, they can thus do so. If users want to contribute to the code base, please contact the author. \section{The NMC Algorithm} The NMC algorithm, proposed by \citet{ye_2010}, finds mutational clusters when the protein is considered to be a straight line. While the full alogrithm is presented in their paper, we provide a brief overview here for completeness. Suppose that the protein was $N$ amino acids long and that each amino acid had a $\frac{1}{N}$ probability of mutation. We can then construct order statistics over many samples as follows: \begin{figure}[h!] \centering \includegraphics{FakeProtein2.jpg} \caption{Three samples of the same protein. An asterisk above a number indicates a non-synonomous mutation in that sample for that amino acid.} \label{fig1} \end{figure} Letting $R_{k,i} = X_{(k)}-X_{(i)}$, one can calculate if the $Pr(R_{k,i}\leq r) \leq \alpha$ using well known results about order statistics on the uniform distribution. While discrete formulas exist for $Pr(R_{k,i}\leq r)$, they are often too costly to calculate when $R_{k,i}>1$. In these cases, we scale the protein onto the interval (0,1) by calculating $Pr(\frac{X_{(k)}-X_{(i)}}{N} \leq r)$ which turns out to equal $ Pr(Beta(k-i, i+n-k+1) \leq r)$. Finally, since this calculation is done for every pair of mutations in the protein, a multiple comparisons adjustment is performed. The original NMC algorithm is included in this package via the \emph{nmc} command. We provide an example of its use below. First, we load \iPAC{} and then the mutation matrix. The mutation matrix is a matrix of 0's and 1's where each column represents an amino acid in the protein and each row represents a sample (or a mutation). Thus, the entry for row i column j, represents the ith sample (or mutation) and the jth amino acid. \begin{verbatim} Code Example 1: Running the NMC algorithm \end{verbatim} <