% \VignetteIndexEntry{HMMcopy} % \VignetteDepends{HMMcopy, data.table} % \VignetteKeywords{Readcount GC Mappability Correction Normalization HTS} % \VignettePackage{HMMcopy} \documentclass[letterpaper]{article} \usepackage{fullpage} \usepackage{url} \title{HMMcopy: A package for bias-free copy number estimation and robust CNA detection in tumour samples from WGS HTS data} \author{Daniel Lai and Gavin Ha} \date{\today} \begin{document} \setkeys{Gin}{width=\textwidth} \maketitle \tableofcontents \section{Introduction} High-throughput whole genome DNA sequencing results in millions of reads which after short-read mapping, can be used to infer the original sequence of the sample. An additional piece of information we gain from mapping is the coverage or number of reads overlapping each position, which can be used as a rough estimate of copy number. In practice, it is often easier to work at a lower resolution, and so we divide the genome into non-overlapping windows of fixed width (\textit{e.g.} 1000), and use the \textit{readcount} or number of reads starting in each of these windows (henceforth termed \textit{bins}). The main assumption of using readcount as a proxy for copy number is that during the sequencing protocol, reads are uniformly sampled from all genetic material present in the original input sample. This assumption has been shown to be incorrect, with reads being sampled at unequal rates depending on the GC content, among other things \cite{Benjamini}. To different extents, this bias exists and varies in all platforms, protocols, and samples. Additionally, Benjamini and Speed \cite{Benjamini} show that two libraries prepared from the same DNA source show different biases, which highlights the fact that matched sample normalization (\textit{i.e.} normalizing tumour copy by matched normal reference) is insufficient to eliminate this bias. \section{Generating Copy Number Profiles} For the remainder of this vignette, we will go through a short example using a matched normal tumour dataset, and show the ease and effects of normalization readcount by GC and mappability. The dataset that is described here belongs to a female triple-negative breast cancer patient from the published dataset \cite{Ha2012,Shah2012}. This genome library was sequenced on the ABI/Life SOLiD platform, generating hybrid lengths of 25bp and 50bp paired-end reads. The reads were aligned using BioScope \\ (\url{https://products.appliedbiosystems.com/}) where reads mapped to multiple sites were ignored. \subsection{Obtaining HTS copy number data} It should be noted that due to memory and speed considerations, the raw input that feeds in to this package are required to be generated by programs distributed as part of the \textit{HMMcopy Suite} available at: In short, the suite has tools to: \begin{itemize} \item Obtain high resolution bin counts for large ($\approx$ 250GB) BAM files within a few hours. \item Obtain GC content for bins from standard FASTA files within minutes for a human genome. \item Obtain average mappability for bins from BigWig within minutes, or FASTA files within a day for a human genome. \end{itemize} \subsection{Loading HTS copy number data} To show the effects of correction in a simple case, let us begin by correcting data obtained from a normal genome sample. To start, we load in WIG files containing three sets of data, namely readcounts, GC content, and average mappability, pre-computed for each bin with an external applications provided by the \textit{HMMcopy Suite}. Briefly, the readcounts data consists of the number of reads mapped within the boundaries of each bin. The GC content file consists of G and C base proportions for each bin; -1 is used when the reference sequence for the bin contains at least one unknown nucleotide base (i.e. N, rather than A,C, T or G). The mappability file, which was generated using \emph{ENCODE Duke Uniqueness of 35bp sequences} track from UCSC \\ (\url{http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=wgEncodeMapability}), provides scores indicating how mappable the reference sequence at each bin is to aligners; the higher the score, the more mappable the sequence. <<>>= options(stringsAsFactors = TRUE) library(HMMcopy) rfile <- system.file("extdata", "normal.wig", package = "HMMcopy") gfile <- system.file("extdata", "gc.wig", package = "HMMcopy") mfile <- system.file("extdata", "map.wig", package = "HMMcopy") normal_reads <- wigsToRangedData(rfile, gfile, mfile) normal_reads[1000:1010, ] @ \subsection{Correcting HTS copy number data} The resultant output is an data.table object, which can easily use other Bioconductor functions and packages for analysis and visualzation. Although it is recommended that the output be left untouched until furthur correction as follows: <<>>= normal_copy <- correctReadcount(normal_reads) normal_copy[1000:1010, ] @ The corrected reads are in the two columns \verb|cor.gc| and \verb|cor.map|, which are corrected and normalized, effectively a copy number estimation for each bin. For downstream analysis, \verb|copy| is used, which is simply the base 2 log values of \verb|cor.map|, to normalize the differences between values above and below 1. Specifically, the columns output from this process are: \begin{itemize} \item[valid] Bins with valid GC and average mappability and non-zero read \item[ideal] Valid bins of high mappability and reads that are not outliers \item[cor.gc] Readcounts after the first GC correction step \item[cor.map] \verb|cor.gc| readcounts after a furthur mappability correction \item[copy] \verb|cor.map| transformed into log2 space \end{itemize} \subsection{Visualizing the effects of correction} To best visualize the biases in the original sample and the effects of correction, convenience plotting functions are available as follows: <>= par(cex.main = 0.7, cex.lab = 0.7, cex.axis = 0.7, mar = c(4, 4, 2, 0.5)) plotBias(normal_copy, pch = 20, cex = 0.5) @ As observed, a unimodal relationship typically exists between readcount and GC content (excuse the weak tail), which is corrected and eliminated in the GC correction step. The GC-corrected readcounts exhibit a slight correlation with mappability, which is corrected in the second step. \subsection{Visualizing corrected copy number profiles} To see the effects of correction for copy number estimation, we can plot the copy number along a segment of a chromosome as follows: <>= par(mar = c(4, 4, 2, 0)) plotCorrection(normal_copy, pch = ".") @ In the top track, the original uncorrected reads are used to estimate copy number (by simply dividing by the median value), followed by tracks with additional GC then mappability correction. Mediam absolute deviations (MAD) measuring the variance of each track can be seen to decrease after each correction step. \subsection{Correcting and visualizing tumour copy number profiles} The identical process can be applied to tumour genomes using the same GC and mappability files as follows: <>= tfile <- system.file("extdata", "tumour.wig", package = "HMMcopy") tumour_copy <- correctReadcount(wigsToRangedData(tfile, gfile, mfile)) par(mar = c(4, 4, 2, 0)) plotCorrection(tumour_copy, pch = ".") @ Note that at this stage, the correction of the normal and tumour samples were done \emph{indepedently}, meaning that \textbf{HMMcopy is usable both in single sample and matched sample libraries.} \section{Segmentation and Classification of Copy Number Profiles} While corrected copy number data points are pleasing to stare at, they are often of much more practical use if we could group and classify them segments of specific copy number variation events. Included in the package is an implementation of an Hidden Markov Model \cite{shah} which does two main things: \begin{enumerate} \item Segments the copy number profile in regions predicted to be generated by the same copy number variation event \item Predicts the copy number variation event for each segment, based on the biological likelihood of the event occuring \end{enumerate} <<>>= tumour_segments <- HMMsegment(tumour_copy) @ \subsection{Visualizing segments and classified states} We can visualize both the resultant segments and the states assigned to each segment easily as follows: <>= par(mfrow = c(1, 1)) par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0)) plotSegments(tumour_copy, tumour_segments, pch = ".", ylab = "Tumour Copy Number", xlab = "Chromosome Position") cols <- stateCols() # 6 default state colours legend("topleft", c("HOMD", "HETD", "NEUT", "GAIN", "AMPL", "HLAMP"), fill = cols, horiz = TRUE, bty = "n", cex = 0.5) @ As observed, there are six coloured states, corresponding to six biological CNA events: \begin{itemize} \item[HOMD] Homozygous deletion, $\leq$ 0 copies \item[HETD] Heterozygous deletion, 1 copy \item[NEUT] Neutral change, 2 copies \item[GAIN] Gain of chromosome, 3 copies \item[AMPL] Amplification event, 4 copies \item[HLAMP] High level amplification, $\geq$ 5 copies \end{itemize} Segments are seen as neon green lines running through regions estimate to belong to the same CNA event. \subsection{Improving segmentation performance} The segmentation program will attempt its best at creating the most biologically likely state assignment, but unfortunately and fortunately for us humans, the automated program can often fail, as seen in this specific example. In cases like these, there is then a need to adjust the parameters generated by the algorithm. Instead of getting parameters from scratch however, one can easily retrieve the default set of parameters as follows: <<>>= default_param <- HMMsegment(tumour_copy, getparam = TRUE) default_param @ By calling the same segmentation algorithm with the \verb|getparam| argument set to \verb|TRUE|, the algorithm generates parameters without actually running the segmentation and returns it. There are 10 parameters explained in detail in the documentation (\textit{i.e.} \verb|?HMMsegment|), each having values across six states in each row. \subsection{Reducing the number of segments} To reduce the segment, the two parameters we need to adjust are \verb|e| and \verb|strength|, which are the suggested probability of staying in a segment, and the strength of your suggestion to the algorithm. Once set, we run the segmentation algorithm, but this time expliciting providing our new parameters. <<>>== longseg_param <- default_param longseg_param$e <- 0.999999999999999 longseg_param$strength <- 1e30 longseg_segments <- HMMsegment(tumour_copy, longseg_param, verbose = FALSE) @ And finally, we can visualize the results to confirm a decrease in segments as intended. <>= par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0)) plotSegments(tumour_copy, longseg_segments, pch = ".", ylab = "Tumour Copy Number", xlab = "Chromosome Position") legend("topleft", c("HOMD", "HETD", "NEUT", "GAIN", "AMPL", "HLAMP"), fill = cols, horiz = TRUE, bty = "n", cex = 0.5) @ As one can imagine, if for some odd reason you'd like to increase the number of segments, simply decrease \verb|e| and \verb|strength| to a small value (close to 0). Although the algorithm will almost certainly take a few times longer to run. \subsection{Adjusting copy number state ranges} A second more subtle and debilitating problem seen in this profile is actually the incorrect median of each copy number state. Specifically, this is a problem with the \verb|mu| parameter. Accessing one of the output values of the segmentation process: <<>>= longseg_segments$mus @ We see the median of 6 states (rows) after each iteration of the optimization algorithm (columns). The first column is our initial suggested \verb|mu|: <<>>= longseg_param$mu @ And the last column is the actual values used during the segmentation process, which we can visualize as follows: <>= par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0)) plotSegments(tumour_copy, longseg_segments, pch = ".", ylab = "Tumour Copy Number", xlab = "Chromosome Position") for(i in 1:nrow(longseg_segments$mus)) { abline(h = longseg_segments$mus[i ,ncol(longseg_segments$mus)], col = cols[i], lwd = 2, lty = 3) } abline(v = 7.68e7, lwd = 2, lty = 3) abline(v = 8.02e7, lwd = 2, lty = 3) @ From this, it becomes obvious that the medians are clearly not running through the middle of the visually obvious segments for many of the states. For example, the red medians for \verb|GAIN|, \verb|AMPL| and \verb|HLAMP| are much too high, the green medians for \verb|HOMD| and \verb|HETD|are much too close together. Finally there is a clear segment flanked by vertical black dashed lines near the centre of the chromosome that is stuck between medians for green \verb|HETD| and blue \verb|NEUT|, when it should definitively belong to one or the other. Staring at such a diagram, it is easy to suggest new \verb|mu| params, which can be put back into the segmentation algorithm once again and visualized as follows. <>= newmu_param <- longseg_param newmu_param$mu <- c(-0.5, -0.4, -0.15, 0.1, 0.4, 0.7) newmu_segments <- HMMsegment(tumour_copy, newmu_param, verbose = FALSE) par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0)) plotSegments(tumour_copy, newmu_segments, pch = ".", ylab = "Tumour Copy Number", xlab = "Chromosome Position") @ \subsubsection{Understanding parameter convergence} The observant reader will noticed nothing has changed. And this was done to purposely highlight that the parameters you set are often just \emph{suggestions}, and given sufficient flexibility, the algorithm will effectively ignore your suggestions. As seen below in \verb|newmu_segments|, after several iterations our newly suggested values of \verb|mu| (first column) effectively converge (in the last column) to the same default values of \verb|mu| in \verb|longseg_param|. <<>>= newmu_segments$mus longseg_param$mu @ \subsubsection{Overriding parameter convergence} The solution is to simply disallow the algorithm from making large shifts to \verb|mu|, which we can achieve by setting the prior mean of \verb|n| (\textit{i.e.} \verb|m|) to values identical to \verb|mu|. <>= par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0)) newmu_param$m <- newmu_param$mu realmu_segments <- HMMsegment(tumour_copy, newmu_param, verbose = FALSE) plotSegments(tumour_copy, realmu_segments, pch = ".", ylab = "Tumour Copy Number", xlab = "Chromosome Position") @ \section{Matched Tumour-Normal Sample Correction} So far, our entire process has been done on single samples, providing respectable results and aesthetically pleasing profiles. However, when give matched tumour and normal pairs, we can improve our copy number results in two ways: \begin{enumerate} \item We can divide our tumour copy number profile by normal copy number profile to eliminate any germline copy number variation (CNV) events present in both normal and tumour profiles. This will leave only somatic copy number aberration (CNA) events in the tumour. \item A dramatic reduction in noise is also observed, furthur improving copy number profiles. \end{enumerate} \subsection{Normalizing tumour by normal copy number profiles} The normalization is a simple division of tumour copy number by normal copy number in a bin-wise fashion as follows, or a subtracting of the values in log space as follows: <<>>= somatic_copy <- tumour_copy # LOGARITHM IDENTITY: log(a) - log(b) == lob(a / b) somatic_copy$copy <- tumour_copy$copy - normal_copy$copy @ We can then do the same segmentation and visualization as follows: <>= somatic_segments <- HMMsegment(somatic_copy, newmu_param, verbose = FALSE) par(cex.main = 0.5, cex.lab = 0.5, cex.axis = 0.5, mar = c(2, 1.5, 0, 0), mgp = c(1, 0.5, 0)) plotSegments(somatic_copy, somatic_segments, pch = ".", ylab = "Tumour Copy Number", xlab = "Chromosome Position") @ The resultant corrected reads can easily be analyzed and exported with other R and Bioconductor packages. \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \bibliographystyle{plain} \bibliography{HMMcopy} %\begin{thebibliography}{99} % \bibitem[Benjamimi and Speed, 2011]{Benjamini} % Benjamini Y, Speed TP. % (2012) % Summarizing and correcting the GC content bias in high-throughput sequencing. \textit{Nucleic Acids Res.} \textbf{10}, e72. % \bibitem[Shah \textit{et~al.}, 2006]{Shah} % Shah,S.P. \textit{et~al.} % (2006) % Integrating copy number polymorphisms into array CGH analysis using a robust HMM. \textit{Bioinf.} \textbf{22}, e341-9. %\end{thebibliography} \end{document}