%\VignetteIndexEntry{Scale4C: an R/Bioconductor package for scale-space transformation of 4C-seq data} %\VignettePackage{Scale4C} \documentclass[10pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage[utf8]{inputenc} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \SweaveOpts{prefix.string=Scale4C} \begin{document} \SweaveOpts{concordance=TRUE} \title{Scale4C: an R/Bioconductor package for scale-space transformation of 4C-seq data} \author{Carolin Walter} \date{\today} \maketitle \tableofcontents <>= options(width=90) options(continue=" ") @ \section{Introduction} \Rpackage{Scale4C} uses Witkin's scale-space filtering to describe a 4C-seq signal (chromosome conformation capture combined with sequencing) qualitatively \cite{Witkin01}. To this extend, the original data signal is smoothed with Gauss kernels of increasing $\sigma$ ("sigma"). Inflection points (i.e. the borders between different features) can be tracked through the resulting 2D matrix, or 'fingerprint map', which shows the position of a fragment on the x-axis and a range of smoothing $\sigma$ on the y-axis. Each entry in the matrix encodes whether there is an inflection point at a certain position for a given $\sigma$ (and if so, whether it's a switch from convex to concave or vice versa), or not. In this context, a 'singularity' is defined as a singular point in the fingerprint map, which means that both borders of the corresponding feature are identical for the singular point's $\sigma$. In case of Gauss kernel smoothing, this is only possible when a feature ('peak' / 'hill' or 'valley') appears or disappears, depending on whether the smoothing factor $\sigma$ is decreased or increased \cite{Witkin01}. Tracing singularities down their inflection point contours through the fingerprint map allows for a precise localization of distorted, strongly smoothed features. The singularities and localization information can then be transformed into a simple tree structure. Each singularity hereby marks the point where three children spawn from one parent feature. Intuitively (and, in case of Gauss kernels, correctly \cite{Witkin01, Lee01}), one would assume that a peak is surrounded by two valleys and vice versa, and that when a smaller peak is smoothed out, the valleys around its former position will form one new valley, and that's what is reflected in the tree. All children of a chosen parent start at the same $\sigma$, however, their range of $\sigma$ for which they persist can vary among the children. These ternary trees can be visualized in a 2D tesselation map, which allows to assess prominent features with a high degree of stability for multiple values of $\sigma$ visually. \subsection{Loading the package} After installation, the package can be loaded into R by typing <>= library(Scale4C) @ into the R console. \\ \subsection{Provided functionality} \Rpackage{Scale4C} requires the R-packages \Rpackage{smoothie}, \Rpackage{IRanges}, \Rpackage{SummarizedExperiment} and \Rpackage{GenomicRanges}. This package provides the following basic functions for the scale-space transformation of 4C-seq data: \\ \begin{itemize} \item \Rfunction{calculateFingerprintMap}: first of two central functions; computes the inflection points for the signal when smoothed with different Gauss kernels \item \Rfunction{findSingularities}: second of two central functions; identifies and tracks singularities through the fingerprint map \item \Rfunction{plotTesselation}: plots the final tesselation of the scale space for the given 4C-seq signal, as provided by the singularities and tracking results \end{itemize} \centerline{} Optional functions include:\\ \begin{itemize} \item \Rfunction{importBasic4CseqData}: transforms \Rpackage{Basic4Cseq}'s fragment-based table output for use with \Rpackage{Scale4C} \item \Rfunction{plotTraceback}: plots the original fingerprint map and traceback results for a given 4C-seq signal \item \Rfunction{plotInflectionPoints}: plots the smoothed 4C-seq signal for one chosen $\sigma$, with or without matching inflection points \item \Rfunction{outputScaleSpaceTree}: outputs a \Rpackage{GenomicRanges} object, including the size (i.e. range of $\sigma$) for detected features, based on the singularity-derived scale-space tree data \end{itemize} In addition to the examples presented in this vignette, more detailed information on the functions' parameters and additional examples are presented in the corresponding manual pages. \section{Scale-space transformation of 4C-seq data} Similar to \Rpackage{Basic4Cseq}, \Rpackage{Scale4C} includes fetal liver data of \cite{Stadhouders01} to demonstrate central functions. Due to size limits and performance issues, only a subset of its near-cis fragment data is included. \subsection{Import of raw fragment data} \Rpackage{Scale4C} expects a \Rclass{GRanges} object as input, which includes chromosome name, start, end, "reads" and "meanPosition" (mean of start and end). If you have got an appropriate data frame or similar, the \Rpackage{GenomicRanges} contains the relevant conversion functions. Since \Rpackage{Basic4Cseq} already offers an output function for fragment data, albeit with more information than \Rpackage{Scale4C} will ever need, a converter function is included in this package that can read \Rpackage{Basic4Cseq}'s output tables and extract the relevant data. Per default, this is the fragment data within a certain distance from the viewpoint of the experiment. <>= csvFile <- system.file("extdata", "liverData.csv", package="Scale4C") liverReads <- importBasic4CseqData(csvFile, viewpoint = 21160072, viewpointChromosome = "chr10", distance = 1000000) head(liverReads) @ Note: The {\it system.file()} expression is used to locate the example data from Stadhouders et al. For custom data, simply entering the full or relative path as a string is sufficient, e.g. {\it csvFile <- "../myFolder/myFragmentData.csv"}. \centerline{} \subsection{Initialisation of a Scale4C object} As soon as the data is available in R, we can create the raw \Rclass{Scale4C} object. <>= liverData = Scale4C(rawData = liverReads, viewpoint = 21160072, viewpointChromosome = "chr10") liverData @ While the viewpoint of a 4C-seq experiment is usually clearly visible as its characteristic overrepresentation causes a visible peak in the data, including further reference points into near-cis plots may facilitate orientation. <>= poiFile <- system.file("extdata", "vp.txt", package="Scale4C") pointsOfInterest(liverData) <- addPointsOfInterest(liverData, read.csv(poiFile, sep = "\t", stringsAsFactor = FALSE)) head(pointsOfInterest(liverData)) @ \subsection{Smoothing, inflection points, fingerprint map, and singularities} \Rpackage{Scale4C} relies on two central functions to perform the scale-space transformation, \Rmethod{calculateFingerprintMap} and \Rmethod{findSingularities}. To create a fingerprint map, we smooth the 4C-seq signal with increasing values of $\sigma$ for the Gauss kernel. \Rmethod{plotInflectionPoints} can plot the smoothed data for a given $\sigma$, effectively showing a non-condensed 'slice' of the scale space and fingerprint map. <>= plotInflectionPoints(liverData, 2, fileName = "", plotIP = FALSE) @ For a $\sigma$ of $2$, both marked elements aside from the viewpoint are clearly located on peak regions. <>= plotInflectionPoints(liverData, 50, fileName = "", plotIP = FALSE) @ We can see that for a $\sigma$ of $50$, the $-36$ element of Stadhouders et al. has been smoothed over; the former peak is now located in a larger valley. The -81 element, however, is still located in a peak region. Note: Since we haven't actually calculated the fingerprint map and inflection points up to now, {\it plotIP} should be set to FALSE. In a similar manner, \Rmethod{calculateScaleSpace} and \Rmethod{calculateFingerprintMap} compute the whole fingerprint map, up to a provided maximum square $\sigma$. The fingerprint data is stored in the second assay of the {\it scaleSpace} slot of a \Rclass{Scale4C} object. \Rmethod{findSingularities} then identifies singular points of the inflection point contours in scale-space \cite{Witkin01}. <>= scaleSpace(liverData) = calculateScaleSpace(liverData, maxSQSigma = 10) head(t(assay(scaleSpace(liverData), 1))[,1:5]) @ To speed up the examples, we use a small maximum $\sigma$ for the given example. <>= liverData = calculateFingerprintMap(liverData, maxSQSigma = 10) singularities(liverData) = findSingularities(liverData, 1, guessViewpoint = FALSE) @ The resulting fingerprint map and singularity trace for a maxSQSigma of $2000$ are presented in figure \ref{fig:01}. \begin{figure}[h] \includegraphics[width=12cm,height=12cm]{images/Stadhouders_FL_Prom_traceback_2000_liverData.pdf} \caption{Fingerprint map with traced singularities for Stadhouders et al.'s example liver data \cite{Stadhouders01}. The viewpoint wasn't tracked, since the corresponding contours do not meet for the chosen $\sigma$, and the largest remaining singularity's contours were not traced correctly.} \label{fig:01} \end{figure} There are two issues with the fingerprint map: First, the viewpoint is not identified, because the $\sigma$ used is far too small for the inflection point contours to meet. Second, the largest identified singularity has wrong traces for both the left and right side: the dark gray tracking line doesn't match the corresponding inflection point curve or contour. Both problems should be corrected before tesselation. We can easily manipulate the list of singularities with R's usual data frame options, and load the corrected list back into our \Rclass{Scale4C} object. <>= data(liverData) tail(singularities(liverData)) # add viewpoint singularity singularities(liverData) = c(singularities(liverData), GRanges("chr10", IRanges(39, 42), "*", "sqsigma" = 2000, "left" = 40, "right" = 41, "type" = "peak")) # correct singularity's coordinates tempSingularities = singularities(liverData) tempSingularities$left[30] = 235 tempSingularities$right[30] = 250 tempSingularities$type[30] = "peak" singularities(liverData) = tempSingularities tail(singularities(liverData)) @ Note: There are a number of possible problems with the identification and the tracking of singularities. First, singularities can simply be missed altogether. This typically happens for few singularities with a high $\sigma$, i.e. huge remaining basic features of the data, when the upper part of the contours are so stretched that holes appear in the fingerprint map. While it is possible to adapt \Rmethod{calculateFingerprintMap} for a chance to catch these features (cp. manual page), it is often faster to just add a singularity and its corresponding left and right fragment at base $\sigma$ level manually. If the 4C-seq data shows the typical peak at the viewpoint, this allows to speed up the calculations for the fingerprint map a bit: By ignoring the viewpoint singularity at extremely high $\sigma$ and using its contours to visually identify its left and right extent, we can stop at a reasonable $\sigma$ and get a feasible tesselation. The example code above adds the viewpoint peak to the example data's list of singularities. Second, singularities can also be marked erroneously multiple times. This also tends to happen for very large and frazzled contours; either adapting \Rmethod{calculateFingerprintMap}'s {\it epsilon} or simply deleting the false singularities manually solves this problem. Identifying the correct singularity visually should be easy enough with \Rmethod{plotTraceback}'s output plots: A valid singularity is located on the top of a inflection point curve, and nowhere else. Third, $\sigma = 1$ tends to be difficult to trace due to the number of inflection point curves in close proximity and identified singularities (part of which are usually wrong). The results of the example's fingerprint map and singularity search after correction are shown in figure \ref{fig:02}. \begin{figure}[h] \includegraphics[width=12cm,height=12cm]{images/Stadhouders_FL_Prom_traceback_2000_liverDataVP.pdf} \caption{Fingerprint map after manual correction: The viewpoint is added, and the remaining singularities for Stadhouders et al.'s example liver data \cite{Stadhouders01} are traced correctly.} \label{fig:02} \end{figure} If the data looks 'normal', i.e. the 4C-seq signal creates mainly large 'peaks' in the fingerprint map, and no large 'valleys' are present, it is also possible to stop at a lower $\sigma$ and let the function combine the remaining single contours by setting {\it guessViewpoint = TRUE}. In this case, it is assumed that all remaining curves in the fingerprint map belong to peaks, and singularities are formed accordingly. Results for such a combination ($\sigma = 500$) are shown in figure \ref{fig:03}. \begin{figure}[h] \includegraphics[width=12cm,height=12cm]{images/Stadhouders_FL_Prom_traceback_500_liverData.pdf} \caption{Fingerprint map for a smaller maximum $\sigma$ of $500$ with automatically matched contours.} \label{fig:03} \end{figure} \subsection{Tesselation} If the fingerprint map and the singularity traceback appear to be satisfactory, the actual tesselation can be plotted. The example plot in figure \ref{fig:04} uses different colours to depict peaks (brown) and valleys (blue). Each singularity causes one parent feature to split into three child features (e.g. a single peak is split into peak-valley-peak); \Rmethod{plotTesselation} per default marks the central feature of a trio in a darker shade, and adjacent in a lighter one. Different colours for 'central' and 'adjacent' features allow for optical quality control of the tesselation, cp. manual page for \Rmethod{plotTesselation} and \cite{Witkin01, Lee01}. <>= # use pre-calculated example data with VP and correction data(liverDataVP) plotTesselation(liverDataVP, fileName = "") @ \begin{figure}[h] \includegraphics[width=13cm,height=13cm]{Scale4C-plotTesselation} \caption{Tesselation of the x-$\sigma$-plane, as provided by the traced singularities in the fingerprint map. A slice at $\sigma = 50$ corresponds to the smoothed data plot from before, with the -36 element located in a valley and -81 at the neighbouring peak.} \label{fig:04} \end{figure} The tesselation allows to assess the qualitative structure of the provided 4C-seq signal. The development of features through different smoothing scales can be traced, and comparisons between samples regarding their complexity become possible by comparing the number of singularities in a certain range of choice. Since 4C-seq is about contact intensities and chromosomal interactions, the most interesting regions tend to be peaks that persist for larger intervals of $\sigma$. We can also output the underlying structure as a simple table, with or without $\sigma$ $log2$ transformation. If {\it outputPeaks} is set to FALSE, the output table shows a list of all central features, as specified by singularities in the fingerprint map, and their corresponding adjacent left and adjacent right features with their range of persistence in $\sigma$, and left and right extent in fragments / genomic position. Otherwise, valleys are omitted from the output, and the data is exported as \Rpackage{GenomicRanges} object. The features' signals, i.e. the mean read counts for the identified intervals, are also provided. <>= head(outputScaleSpaceTree(liverDataVP, useLog = FALSE)) @ \begin{thebibliography}{} \bibitem[Stadhouders {\it et al}., 2012]{Stadhouders01} Stadhouders, R., Thongjuea, S., {\it et al} (2012) Dynamic long-range chromatin interactions control Myb proto-oncogene transcription during erythroid development, {\it EMBO}, {\bf 31}, 986-999. \bibitem[Witkin, 1983]{Witkin01} Witkin, A. (1983) Scale-space Filtering, {\it Proceedings of the Eighth International Joint Conference on Artificial Intelligence - Volume 2}, IJCAI'83, 1019-1022. \bibitem[Lee {\it et al}., 2013]{Lee01} Lee, J., Lee, U., Kim, B., et al. (2013) A computational method for detecting copy number variations using scale-space filtering, {\it BMC Bioinformatics}, {\bf 14}, 57. \end{thebibliography} \section{Session Information} <>= sessionInfo() @ \end{document}