\documentclass{article} %\VignetteIndexEntry{SpacePAC: Identifying mutational clusters in 3D protein space using simulation} %\VignetteDepends{iPAC} %\VignetteKeywords{Clusters, Amino Acids, Alignment, CIF,Somatic Mutations, NMC} %\VignettePackage{SpacePac} %% packages \usepackage{graphicx} \usepackage{natbib} \usepackage{subfigure} \usepackage{float} \usepackage{caption} \usepackage[nogin]{Sweave} \usepackage{booktabs} \usepackage{algorithm} \usepackage{algorithmic} \usepackage{amsmath} \usepackage{url} \usepackage{placeins} \def \GraphPAC{\textbf{GraphPAC}} \def \iPAC{\textbf{iPAC}} \def \SpacePAC{\textbf{SpacePAC}} \def \TSP{\textbf{TSP}} \def \igraph{\textbf{igraph}} \def \QuartPAC{\textbf{QuartPAC}} \begin{document} \SweaveOpts{concordance=TRUE} \title{\QuartPAC{}: Identifying mutational clusters while utilizing protein quaternary structural data } \author{Gregory Ryslik \\ Genentech \\ gregory.ryslik@yale.edu \and Yuwei Cheng \\ Yale University \\ yuwei.cheng@yale.edu \and Hongyu Zhao \\ Yale University \\ hongyu.zhao@yale.edu} \maketitle \begin{abstract} \end{abstract} The \QuartPAC{} package is designed to identify mutated amino acid hotspots while accounting for protein quaternary structure. It is meant to work in conjunction with the \iPAC{} \citep{iPAC}, \GraphPAC{} \citep{GraphPAC} and \SpacePAC{} \citep{SpacePAC} packages already available through Bioconductor. Spercifically, the package takes as input the quaternary protein structure as well as the mutational data for each subunit of the assembly. It then maps the mutational data onto the protein and performs the algorithms described in \iPAC{}, \GraphPAC and \SpacePAC{} to report the statistically significanct clusters. By integrating the quartneray structure, \QuartPAC{} may identify additional clusters that only become apparent when the different protein subunits are considered together. % As described by \citep{wagner_rapid_2007, zhou_detecting_2008,ye_2010,ipac_paper_2013,graphpac_paper_2013,spacepac_paper_2014}, mutational clustering may be a sign of positive selection of protein function and/or activating driver mutations. \section{Introduction} \label{intro} Recent advances in oncogenic pharmacology \citep{croce_oncogenes_2008} have led to the creation of a variety of methods that attempt to identify mutational hotspots as these hotspots are often indicative of driver mutations \citep{wagner_rapid_2007, zhou_detecting_2008,ye_2010}. Three recent methods, \iPAC{}, \GraphPAC{} and \SpacePAC{} provide approaches to identify such hotspots while accounting for protein tertiary structure. While it has been shown that these mutations provide an improvement over linear clustering methods, \citep{ipac_paper_2013,graphpac_paper_2013,spacepac_paper_2014}, they nevertheless consider only tertiary structure. \QuartPAC, preprocesses the entire assembly structure in order to be able to accurately run these approaches on the quaternary protein unit. This allows for the identification of additional mutational clusters that may otherwise be missed if only one protein subunit is considered at a time. %Continue here In order to run \QuartPAC, four sources of data are required: \begin{itemize} \item The amino acid sequence of the protein which is obtained from the UniProt database (\url{uniprot.org} in FASTA format). \item The protein tertiary subunit information which is obtained from the .pdb file from \url{PDB.org} \item The quaternary structural information for the entire assembly which is obtained from the .pdb1 file from \url{PDB.org} \item The somatic mutation data which is obtained from the Catalogue of Somatic Mutations in Cancer (\url{http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/}). \end{itemize} In order to map the mutations onto the protein quaternary structure, an alignment must be performed. For each uniprot within the assembly, mutational data must be provided. The data is in the format of $m \times n$ matrices for every subunit. A ``1" in the $(i,j)$ element indicates that residue $j$ for individual $i$ has a mutation while a ``0" indiciates no mutation. To be compatible with this software, please ensure that your mutation matrices have R column headings of $V1, V2,\cdots, Vn$. Only missense mutations are currently supported, indels in the amino acid sequence are not. Sample mutational data are included in this package as textfiles in the \emph{extdata} folder. It is worth nothing that there does not exist any one individual source to obtain mutational data. One common resource is the COSMIC database \url{http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/}. The easiest way to obtain mutational data for many proteins is to load the the COSMIC database on a local sql server and then query the database for the protein of interest. It is important to restrict your query to whole gene screens or whole genome studies to prevent specific mutations from being selectively chosen (and thus violating the uniformity assumption that \iPAC, \GraphPAC, and \SpacePAC{} rely upon). Should you find a bug, or wish to contribute to the code base, please contact the author. \section{Identifying Clusters and Viewing the Remapping} \label{Spheres} The general principle of \QuartPAC{} is that we preprocess the data into a format that can be recognized by \iPAC, \GraphPAC{} and \SpacePAC. Most of this is automated and all that is needed is to point the algorithm to the mutational and structural data. \QuartPAC{} will then reorganize the data, execute the cluster finding algorithms and report the results. The clusters are reported by serial number. As each serial number is unique in the assembly, the user can then map each serial number to the exact atom of interest in the structure. Below we run the algorithm with no multiple comparison adjustment. We do this to ensure that some clusters are found for each method. We also note that for \iPAC{} and \GraphPAC{}, if a multiple comparison adjustment is used and no clusters are found significant, the methods will show a null value. For \SpacePAC{}, as there is no multiple comparison adjustment needed, the most significant clusters are always shown, regardless of the p-value. This behavior follows the functionality of the previous three packages, so users familiar with the tertiary algorithms will find the results directly comparable. For more information on the output, please see the \iPAC,\GraphPAC, and \SpacePAC{} packages as the output is similar. The main difference is that the amino acid numbers now refer to the serial numbers within the *.pdb1 file. \begin{verbatim} Code Example 1: Running QuartPAC. \end{verbatim} \begin{small} <>= library(QuartPAC) #read the mutational data mutation_files <- list( system.file("extdata","HFE_Q30201_MutationOutput.txt", package = "QuartPAC"), system.file("extdata","B2M_P61769_MutationOutput.txt", package = "QuartPAC") ) uniprots <- list("Q30201","P61769") mutation.data <- getMutations(mutation_files = mutation_files, uniprots = uniprots) #read the pdb file pdb.location <- "https://files.rcsb.org/view/1A6Z.pdb" assembly.location <- "https://files.rcsb.org/download/1A6Z.pdb1" structural.data <- makeAlignedSuperStructure(pdb.location, assembly.location) #Perform Analysis #We use a very high alpha level here with no multiple comparison adjustment #to make sure that each method provides shows a result. #Lower alpha cut offs are typically used. quart_results <- quartCluster(mutation.data, structural.data, perform.ipac = "Y", perform.graphpac = "Y", perform.spacepac = "Y", create.map = "Y",alpha = .3,MultComp = "None", Graph.Title ="MDS Mapping to 1D Space", radii.vector = c(1:3)) @ \end{small} We observe that the MDS remapping plot provided by \QuartPAC{} is done automatically if the \emph{create.map} parameter is set to ``Y". The plot is shown in Figure \ref{MDSRemap} below. For the \GraphPAC{} approach, the linear ``Jump Plot" (see the \GraphPAC{} package for more details and interpretation) has been implemented and is shown in Figure \ref{GraphPACRemap} below. Feel free to contact the author if you want to assist in porting other graphing functionality. With regards to \SpacePAC{}, as there is no remapping from 3D to 1D space, a plotting option that shows the protein in its folded state is presented in Section \ref{visualizing}. \begin{figure} [!h] \begin{center} \includegraphics[scale = .9]{QuartPAC-Example1.pdf} \end{center} \caption{Remapping performed by \iPAC.} \label{MDSRemap} \end{figure} \begin{verbatim} Code Example 2: Plotting the GraphPAC candidate path. \end{verbatim} \begin{center} \begin{small} <>= Plot.Protein.Linear(quart_results$graphpac$candidate.path, colCount = 10, title = "Protein Reordering to 1D Space via GraphPAC") @ \end{small} \end{center} \begin{figure} [!h] \begin{center} \includegraphics[scale = 1]{QuartPAC-Example2.pdf} \end{center} \caption{Remapping performed by \GraphPAC.} \label{GraphPACRemap} \end{figure} \FloatBarrier \section{Using the Output} Now that we have the results, suppose that we wanted to visualize what the clusters are. For example, we see that the first cluster under the \SpacePAC{} method for the optimal combination has two spheres. One sphere is centered at the atom with serial number 1265 and one sphere is centered at the atom with serial number 367. To see where this matches we can query the \emph{structural.data} list. \begin{verbatim} Code Example 3: Finding the residue of interest using the SpacePAC method. \end{verbatim} \begin{small} <>= #look at the results for the optimal sphere combinations under the SpacePAC approach #For clarity we only look at columns 3 - 8 which show the sphere centers. quart_results$spacepac$optimal.sphere[,3:8] #Find the atom with serial number 1265 required.row <- which(structural.data$aligned_structure$serial == 1265) #show the information for that atom structural.data$aligned_structure[required.row,] @ \end{small} Similarly, suppose you wanted to look at the \iPAC{} results. The first cluster goes from serial 2583 and ends at 2846. To get all the residue information for that block, we can do the following: \begin{verbatim} Code Example 4: Finding the residue of interest using the iPAC method. \end{verbatim} \begin{small} <>= #look at the results for the first cluster shown by the ipac method quart_results$ipac #Find the atoms with serial numbers within the range of 2583 to 2846 required.rows <- which(structural.data$aligned_structure$serial %in% (2583:2846)) #show the information for those atoms structural.data$aligned_structure[required.rows,] @ \end{small} As the \GraphPAC{} results are in the same format as the \iPAC{} results, the approach for identifying clusters in those atoms is identical as in the example above. \section{Visualizing the Results} \label{visualizing} Once you have the serial numbers of interest, you can then view the results in any pdb visualization application of your choice. One common option is to use the \emph{PyMOL} software package \citep{PyMOL}. While it is not the purpose of this vignette to teach the reader \emph{PyMOL} syntax, we present the following simplistic example and the resulting figure for reference. It will color the first cluster outputted by the \iPAC{} method, residues with serial numbers 2583-2846 in blue. The chain and resSeq information provided in Example 4 is used as below. \begin{verbatim} Code Example 5: PyMOL sample code ----------------------------------- hide all show cartoon, show spheres, ///b/46/ca show spheres, ///b/77/ca color blue, ///b/46-77 label c. B and n. CA and i. 46, "(%s, %s)" % (resn, resi) label c. B and n. CA and i. 77, "(%s, %s)" % (resn, resi) set label_position, (3,2,10) \end{verbatim} \begin{center} \includegraphics[scale=0.5]{1A6Z.png} \end{center} \bibliography{refs}{} \bibliographystyle{plainnat} \end{document}