\documentclass[a4paper]{article} \usepackage{a4wide} \setlength{\parskip}{0.7ex plus0.1ex minus0.1ex} \setlength{\parindent}{0em} \usepackage{times} \usepackage{hyperref} \usepackage{Sweave} \DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em} \DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \usepackage{rotating} \title{MethylSeekR} \author{Lukas Burger, Dimos Gaidatzis, Dirk Sch\"ubeler and Michael Stadler} \date{Modified: July 25, 2013. Compiled: \today} \begin{document} \SweaveOpts{engine=R} %\VignetteIndexEntry{MethylSeekR} %\VignetteDepends{MethylSeekR, BSgenome.Hsapiens.UCSC.hg18, rtracklayer} %\VignetteKeywords{MethylSeekR, segmentPMDs, segmentUMRsLMRs} %\VignettePackage{MethylSeekR} \maketitle \tableofcontents \newpage \section{Introduction} This vignette describes how to use \textit{MethylSeekR} to find regulatory regions from whole-genome bisulfite-sequencing (Bis-seq) methylation data. The basic ideas of the method were introduced in \cite{Stadler:2011vn} and the validity of the approach demonstrated on Bis-seq data from mouse embryonic stem cells and neural progenitors. MethylSeekR represents a refined and extended version of the inital approach that makes it more robust and generally applicable and allows much simpler and interactive setting of segmentation parameters. For details regarding the methodology, we refer the reader to \cite{Burger2012}. If you use MethylSeekR in a publication, please cite \cite{Burger2012}. It is important to note that MethylSeekR has been designed for the segmentation of Bis-seq datasets with a minimal genome-wide coverage of 10X and we do NOT recommend its use for datasets with lower coverage. \section{Prerequisites} Several functions of \textit{MethylSeekR} require the genome sequence or derived information such as the chromsome lengths of the reference genome used. To retrieve this information, \textit{MethylSeekR} makes use of the \textit{BSgenome} package and the related genome data packages. \textit{BSgenome} can be installed from Bioconductor (\url{http://www.bioconductor.org}) using the following commands <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("BSgenome") @ After the installation, the available genomes can be displayed with <>= library(BSgenome) available.genomes() @ For the example data that comes along with the package, we need the hg18 assembly of the human genome \textit{BSgenome.Hsapiens.UCSC.hg18}. This genome, and analogously any of the other genomes, can be installed as follows <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("BSgenome.Hsapiens.UCSC.hg18") @ \section{Data import and SNP filtering} We start by loading the \textit{MethylSeekR} package <>= library(MethylSeekR) @ One step of the \textit{MethylSeekR} workflow depends on random permutations of the data (see below), which may lead to slightly different results each time the calculations are performed. It is thus recommended to set the random number generator seed at the beginning of the analysis to ensure exact reproducibility of the results. In R, this can be done e.g. as follows <>= set.seed(123) @ In a first step, the Bis-seq data needs to be loaded into R. The data should be in a tab-delimited file with four columns, \textit{chromosome}, \textit{position}, \textit{T} and \textit{M}, where \textit{position} is the coordinate of the cytosine in the CpG on the plus strand (we assume that the counts have been pooled from the two strands), \textit{T} is the total number of reads overlapping the cytosine and \textit{M} the total number of reads without cytosine to thymine conversion (reflecting methylated cytosines). An example file, containing the methylation information for $200,000$ CpGs from chromosome 22 of human IMR90 cells \cite{Lister:2009fk}, is part of the package and its path can be retrieved with the following command <>= system.file("extdata", "Lister2009_imr90_hg18_chr22.tab", package="MethylSeekR") @ The data can be loaded into R using the \texttt{readMethylome} function. This function takes two arguments, the name of the file containing the methylation data and a named vector containing the chromosome lengths of the reference genome. This vector can be easily created using the \textit{BSgenome} package of the reference genome. For the example data used here, we work with the human hg18 assembly. <>= library("BSgenome.Hsapiens.UCSC.hg18") sLengths=seqlengths(Hsapiens) head(sLengths) @ \texttt{readMethylome} will directly create a GRanges object containing the methylation data. GRanges objects (as defined in the \textit{GenomicRanges} package) represent general purpose containers for storing genomic intervals and allow for very efficient overlap/intersection operations. To use \textit{MethylSeekR}, it is not necessary to understand how GRanges objects work, and we refer the interested reader to \texttt{help(package="GenomicRanges")}. To read in the methylation data of our example data, we can use the following commands. <>= methFname <- system.file("extdata", "Lister2009_imr90_hg18_chr22.tab", package="MethylSeekR") meth.gr <- readMethylome(FileName=methFname, seqLengths=sLengths) head(meth.gr) @ Each element of the \texttt{meth.gr} object stands for one CpG and the corresponding metadata (columns on the right) contains its counts $T$ and $M$. In cases where the genomic background of the sample differs from the reference genome which the methylation data has been aligned against, CpGs overlapping SNPs should be removed from the methylation table \cite{Stadler:2011vn, Burger2012}. To this end, \texttt{MethylSeekR} provides functions to both load a SNP table and remove the SNP-overlapping CpGs from the methylation GRanges object (the following four code chunks can be skipped if the genomic background of the sample is identical to the reference genome). Similary to the methylation data, the SNP table can be loaded using the \texttt{readSNPTable} function. The SNP data should be in a tab-delimited text file with two columns corresponding to the chromosome and position of the SNPs. The path to an example SNP file with data downloaded from dbSNP \cite{Sherry:2001fk} for hg18 (and prefiltered to keep the file size small) can be retrieved as follows <>= system.file("extdata", "SNVs_hg18_chr22.tab", package="MethylSeekR") @ To read in the SNPs from our example file, we can thus use the following commands <>= snpFname <- system.file("extdata", "SNVs_hg18_chr22.tab", package="MethylSeekR") snps.gr <- readSNPTable(FileName=snpFname, seqLengths=sLengths) head(snps.gr) @ Here, \texttt{seqLengths} is the same named vector with chromsome lengths as needed for the \texttt{readMethylome} function. \texttt{readSNPTable} directly returns a GRanges object containing the coordinates of the SNPs. Using this object, the CpGs overlapping SNPs can now be removed from the methylation object using the \texttt{removeSNPs} function. <<>>= meth.gr <- removeSNPs(meth.gr, snps.gr) @ \section{Segmentation of partially methylated domains} Some methylomes contain large regions of seemingly disordered methylation which have been termed partially methylated domains (PMDs, \cite{Lister:2009fk}). These regions need to be identified and masked prior to the segmentation of regulatory regions. To decide whether a methylome contains PMDs, we can first calculate the distribution of $\alpha$-values for one selected chromosome. $\alpha$ characterizes the distribution of methylation levels in sliding windows containing 100 consecutive CpGs along the genome. $\alpha$-values smaller than 1 reflect a polarized distribution, favoring low and high methylation (as in regulatory regions and background methylation levels respectively), whereas $\alpha$-values equal to 1 or larger indicate distributions that are rather uniform or polarized towards intermediate methylation levels (as in PMDs). The distribution of $\alpha$-values for one selected chromosome can be determined as follows, <>= plotAlphaDistributionOneChr(m=meth.gr, chr.sel="chr22", num.cores=1) @ \texttt{chr.sel} denotes the chromosome for which the statistic is calculated. Generally, the $\alpha$ distribution of different chromosomes are very similar, but we recommend using a small autosome to minimize the computing time and to avoid sex-specific biases. \texttt{plotAlphaDistributionOneChr} generates a figure showing the inferred $\alpha$ distribution. This figure is by default printed to the screen but can also be saved in a pdf file if a filename is provided via the \texttt{pdfFilename} argument. \texttt{num.cores} is the number of cores that are used for the calculation. By default, all functions of \textit{MethylSeekR} perform their calculations on a single core. However, almost all functions can be run in parallel (by setting \texttt{num.cores} to a number larger than one, only availabe on unix and mac) and we recommend the user to use multiple cores in order to speed up the calculations. To determine the number of cores of the machine used, use the following commands <>= library(parallel) detectCores() @ Generally, there is evidence for PMDs if the distribution of $\alpha$-values is bimodal or long-tailed with a significant fraction of $\alpha$ values larger or equal $1$. In the example here, the distribution is clearly bimodal and we thus, in a next step, run a Hidden Markov Model (HMM) to identify PMDs genome-wide (in cases where the $\alpha$ distribution is unimodal (for examples see \cite{Burger2012}), there is no evidence for PMDs and the following three code chunks can be omitted). <>= PMDsegments.gr <- segmentPMDs(m=meth.gr, chr.sel="chr22", seqLengths=sLengths, num.cores=1) head(PMDsegments.gr) @ Here, as above, \texttt{chr.sel} is the chromosome used to train the HMM. \texttt{segmentPMDs} generates a GRanges object which partitions the genome into PMDs and regions outside of PMDs (notPMD) as indicated in the first column of the metadata (\texttt{type}). The second column (\texttt{nCG}) of the metadata indicates the number of covered (by at least one read) CpGs in each region. The function creates a figure (which is by default printed to the screen or saved in a pdf file if a filename is provided via the \texttt{pdfFilename} argument) which shows the same alpha distribution as in the previous figure, but including the fitted Gaussian emission distributions of the HMM in red and green. If the HMM fitting works properly, the two Gaussians should correspond to the two peaks of the bimodal distribution (or the main peak and the tail if the alpha distribution is long-tailed), as illustrated for the example dataset. As a control, we may want to recalculate the distribution of $\alpha$-values using only the methylation values outside of the predicted PMDs. <>= plotAlphaDistributionOneChr(m=subsetByOverlaps(meth.gr, PMDsegments.gr[values(PMDsegments.gr)$type=="notPMD"]), chr.sel="chr22", num.cores=1) @ Clearly, we have managed to remove the second mode of the distribution. Such a unimodal distribution with most $\alpha$-values below 1 is also what is typically seen in methylomes without PMDs. Note that the distribution of $\alpha$-values is not only influenced by the presence of PMDs, but also by the variability in background methylation levels outside of regulatory regions. In methylomes that do not contain PMDs, but show large variability, a substantial fraction of $\alpha$-values (10-20\%) may be larger than one. However, if there are no PMDs, the shape of the $\alpha$-distribution should still be unimodal. It is thus rather the shape of the distribution (bimodal versus unimodal) that indicates the presence or absence of PMDs and not the fraction of $\alpha$-values larger 1. In any case, the resulting PMD segmentation should be visually inspected to assess its validity. This can easily be done for randomly chosen regions of the genome using the \texttt{plotPMDSegmentation} function. <>= plotPMDSegmentation(m=meth.gr, segs=PMDsegments.gr) @ \centerline{\includegraphics[angle=90, scale=1.8]{MethylSeekR-myfig1}} The resulting figure is either printed to the screen (default) or saved in a pdf file if a filename is provided via the \texttt{pdfFilename} argument. It is also possible to plot multiple regions and save them in a multi-page pdf file if both the argument \texttt{pdfFilename} and \texttt{numRegions} are provided. The randomly chosen region that is displayed is broken up into 6 panels and in each panel, the raw (ie unsmoothed) methylation levels of all CpGs with a minimal coverage of 5 reads are shown. PMDs are indicated as green bars, extending over the entire PMD. The PMD segmentation can be saved both as a GRanges object and as a tab-delimited file. <>= savePMDSegments(PMDs=PMDsegments.gr, GRangesFilename="PMDs.gr.rds", TableFilename="PMDs.tab") @ \texttt{GRangesFilename} is the filename for the GRanges object and the \texttt{TableFilename} the filename for the tab-delimited file (only of the two filenames is required). \section{Identification of UMRs and LMRs} Once we have identified PMDs (if there are any), we can proceed to the segmentation of regulatory regions. These come in two distinct groups, unmethylated regions (UMRs) which correspond mostly to unmethylated CpG islands (including promoters) and low-methylated regions (LMRs) which correspond to distal regulatory regions \cite{Stadler:2011vn, Burger2012}. We identify these regions by a fairly straightforward procedure. We first smooth methylation levels over $3$ consecutive CpGs to reduce the sampling noise and then identify hypo-methylated regions as regions of consecutive CpGs with smoothed methylation levels below a fixed cut-off $m$, containing a minimal number of CpGs $n$. The identified regions are then further classified into UMRs and LMRs based on the number of CpGs they contain. The parameters $m$ and $n$ must be determined prior to segmentation, which can be done via false discovery rate (FDR) considerations. This approach will naturally take into account the variability of methylation levels in the methylome studied. Briefly, for each set of parameters ($m$, $n$), we determine the number of regions with methylation levels below $m$ and with a minimal number of CpGs $n$ in both the original methylome, $n_o$, and a randomized methylome, $n_r$. The FDR is then defined as the ratio $\frac{n_r}{n_o}$. Given a user-defined cut-off on the FDR, we can then determine reasonable sets of values for $m$ and $n$. To calculate FDRs with \texttt{MethylSeekR}, we can use the function \texttt{calculateFDRs}. Since the FDR calculation is based on the randomization of methylation levels outside of CpG islands \cite{Burger2012}, we first need to create a GRanges object containing the CpG islands, which is then passed on to \texttt{calculateFDRs}. CpG island annotation can for example be retrieved from UCSC using the package \texttt{rtracklayer} <>= library(rtracklayer) session <- browserSession() genome(session) <- "hg18" query <- ucscTableQuery(session, "cpgIslandExt") CpGislands.gr <- track(query) genome(CpGislands.gr) <- NA @ If other genome assemblies than hg18 are used, hg18 in the code above needs to be replaced by the corresponding assembly. Next, we extend the CpG islands to 5kb to make sure that all unmethylated CpGs lying in CpG islands are excluded from the FDR calculation. %results=hide <>= CpGislands.gr <- suppressWarnings(resize(CpGislands.gr, 5000, fix="center")) @ We can now calculate the FDRs <>= stats <- calculateFDRs(m=meth.gr, CGIs=CpGislands.gr, PMDs=PMDsegments.gr, num.cores=1) stats @ %\centerline{\includegraphics[scale=2]{MethylSeekR-myfig2}} Since the FDR calculation also masks PMDs, the PMDs need to be passed to the function as well. In the case of methylomes without PMDs, the \texttt{PMDs} argument can be omitted. The function creates a figure (which is, again, by default printed to the screen or saved in a pdf file if a filename is provided via the \texttt{pdfFilename} argument) which illustrates the relationship between FDR (cut at 20\% if larger than 20\%), $m$, $n$ and the number of identified segments. Additionally, the function returns a list containing a matrix with FDR values (in percent) and a matrix with the number of inferred segments in the original methylome for each $m$ (rows) and $n$ (columns). Since the parameters $m$ and $n$ are partially redundant, we need to set one parameter by hand. $m=0.5$ ($50\%$ methylation) appears to be a reasonable choice for most methylomes (leading to high accuracy at a good sensitivity when overlapping the identified regions with DNase hypersensitive sites \cite{Stadler:2011vn, Burger2012}) and we thus set $m=0.5$ and determine $n$ by choosing the smallest $n$ such that the FDR is below a given cut-off (here $5\%$). <>= FDR.cutoff <- 5 m.sel <- 0.5 n.sel=as.integer(names(stats$FDRs[as.character(m.sel), ] [stats$FDRs[as.character(m.sel), ]>= UMRLMRsegments.gr <- segmentUMRsLMRs(m=meth.gr, meth.cutoff=m.sel, nCpG.cutoff=n.sel, PMDs=PMDsegments.gr, num.cores=1, myGenomeSeq=Hsapiens, seqLengths=sLengths) head(UMRLMRsegments.gr) @ If the methylome under consideration contains no PMDs, the \texttt{PMDs} argument can be omitted. The argument \texttt{myGenomeSeq} takes as input a BSgenome object of the genome assembly used, here Hsapiens, as defined in the package \texttt{BSgenome.Hsapiens.UCSC.hg18} (the genome object is directly loaded with the package and corresponds to the species name in the package name). The genome sequence is needed to retrieve the number of CpGs in the genome per identified region, which is further used to classifiy the regions into UMRs and LMRs. \texttt{segmentUMRsLMRs} returns a GRanges object containing the identified regions. The metadata includes the number of covered CpGs (by at least $5$ reads) per region ($nCG.segmentation$) , the number of CpGs in the genome per region ($nCG$), the total number of reads overlapping CpGs ($T$), the total number of tags overlapping CpGs without conversion event ($M$), the mean methylation ($pmeth=\frac{M}{T}$), the median methylation ($median.meth$, median over the methylation levels of each single CpG) and $type$, which classifies the region into UMR or LMR. For the calculation of $T$, $M$, $pmeth$ and $median.meth$, only CpG that are covered by at least $5$ reads are used. \texttt{segmentUMRsLMRs} also creates a figure (by default printed to the screen or saved in a pdf file if a filename is provided via the \texttt{pdfFilename} argument) which shows the number of CpGs per region versus its median methylation level. Typically, as in the example here, this figure should show a clear separation into unmethylated, CpG-rich regions (UMRs) versus low-methylated, CpG-poor regions (LMRs). The dashed line is at 30 CpGs and represents the cut-off used for the classification. Analogously to the PMD segmentation, the result of this final segmentation can be visualized for a randomly selected region. <>= plotFinalSegmentation(m=meth.gr, segs=UMRLMRsegments.gr, PMDs=PMDsegments.gr,meth.cutoff=m.sel) @ If the methylome under study does not contain any PMDs, the \texttt{PMDs} argument can be omitted. The resulting figure is either printed to the screen (default) or saved in a pdf file if a filename is provided via the \texttt{pdfFilename} argument. It is also possible to plot multiple regions and save them in a multi-page pdf file if both the argument \texttt{pdfFilename} and \texttt{numRegions} are provided. The randomly chosen region that is displayed is broken up into 3 pairs of panels, where in each pair the same region is shown twice, once with raw methylation levels (top) and once with methylation levels smoothed over 3 consecutive CpGs (bottom). In both cases only CpGs with a coverage of at least $5$ reads are shown. The raw data best illustrates the disordered nature of methylation levels in PMDs, whereas the smoothed methylation levels more clearly show UMRs and LMRs. In all figures, UMRs are shown as blue squares (placed at the middle of the identified segment), LMRs as red triangles (placed at the middle of the identified segment) and PMDs as green bars (extending over the entire PMD). The cut-off on methylation to determine UMRs and LMRs is shown as a grey dashed line. Note that UMRs and LMRs can partially overlap with PMDs as only UMRs and LMRs that are fully contained in PMDs are filtered out by the algorithm. \centerline{\includegraphics[angle=90, scale=1.8]{MethylSeekR-myfig3}} Finally, the segmentation of UMRs and LMRs can be saved both as a GRanges object and a tab-delimited file as follows, <>= saveUMRLMRSegments(segs=UMRLMRsegments.gr, GRangesFilename="UMRsLMRs.gr.rds", TableFilename="UMRsLMRs.tab") @ Here, \texttt{GRangesFilename} is the filename for the GRanges object and the \texttt{TableFilename} the filename for the tab-delimited file (only one of the two filenames is required). \bibliography{refs} \bibliographystyle{unsrt} %\bibliographystyle{plain} \end{document}