% \VignetteIndexEntry{sarks-vignette} \documentclass[11pt]{article} \usepackage[dvipsnames]{xcolor} \usepackage{amsmath} \usepackage{amssymb} \usepackage{geometry} \usepackage{graphicx} \usepackage[hidelinks]{hyperref} \usepackage{microtype} \usepackage{natbib} \usepackage{parskip} \usepackage{Sweave} \geometry{ top=1in, inner=1.25in, outer=1.25in, bottom=1in, headheight=3ex, headsep=2ex, } \definecolor{darkgray}{gray}{0.3} \DefineVerbatimEnvironment{Sinput}{Verbatim}{formatcom={\color{MidnightBlue}}} \DefineVerbatimEnvironment{Soutput}{Verbatim}{formatcom={\color{darkgray}}} \DefineVerbatimEnvironment{Scode}{Verbatim}{formatcom={\color{midnightblue}}} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \newcommand{\java}[1]{\texttt{\color{red}#1}} \newcommand{\R}[1]{\texttt{\color{MidnightBlue}#1}} \title{\textbf{S}uffix \textbf{Ar}ray \textbf{K}ernel \textbf{S}moothing for \emph{de novo} discovery of correlative sequence motifs and multi-motif domains: \\ the \R{sarks} package} \author{Dennis Wylie, Hans A. Hofmann, Boris V. Zemelman} \begin{document} \maketitle This vignette describes the R interface \R{sarks} to the java implementation of the \textbf{S}uffix \textbf{Ar}ray \textbf{K}ernel \textbf{S}moothing, or SArKS, algorithm for identification of correlative motifs and multi-motif domains (MMDs). \fbox{If you use \R{sarks} in your work, please cite \cite{wylie2019sarks} (see references).} \tableofcontents \section{How to install \R{sarks}} The \R{sarks} package relies on \java{Java} (1.8 or greater) through use of the \R{rJava} package. Once both of these are installed and correctly configured, you can install \R{sarks} by running the following code within an R session: <>= if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("sarks") @ \section{How to use \R{sarks}} \label{sec:how-to-use} \subsection{Starting \R{sarks}} \label{sec:starting} Because \R{sarks} is implemented using \R{rJava}, and because the default setting for the java virtual machine (JVM) heap space with \R{rJava} is quite low, you should initialize java with increased heap size \emph{before loading} \R{sarks} (or any other \R{rJava}-dependent R packages): <>= options(java.parameters = '-Xmx8G') ## sets to 8 gigabytes: modify as needed library(rJava) .jinit() ## --------------------------------------------------------------------- ## once you've set the java.parameters and then called .jinit from rJava ## (making sure not to load any other rJava-dependent packages prior to ## initializing the JVM with .jinit!), ## you are now ready to load the sarks library: ## --------------------------------------------------------------------- library(sarks) @ \subsection{Input} \label{sec:input} Aside from the specification of a few analysis parameters to be discussed below, \R{sarks} requires two pieces of input data: \begin{description} \item[sequences] which may be specified as either a: \begin{description} \item[FASTA formatted text file] (may be gzipped) or, equivalently, a \item[named character vector] \end{description} \item[scores] which may be specified as either a: \begin{description} \item[named numeric vector] (with names matching those of input sequences) or a \item[tsv] two-column tab-delimited file (containing header line), with columns \begin{enumerate} \item containing names matching those of input sequences and \item containing numeric scores assigned to those sequences \end{enumerate} \end{description} \end{description} For the purposes of this vignette, we'll take the sequences to be the named character vector \R{simulatedSeqs} included in the \R{sarks} package. \R{simulatedSeqs} is a character vector of length 30---representing 30 different (fake) DNA sequences---each 250 characters in length: <>= options(continue=" ") ## for vignette formatting, can be ignored data(simulatedSeqs) length(simulatedSeqs) table(nchar(simulatedSeqs)) @ Take a look at the first few characters of each sequence: <>= vapply(simulatedSeqs, function(s) {paste0(substr(s, 1, 10), '...')}, '') @ Notice that the names of the simulated sequences are just the (string representations of) the numbers \texttt{"0"} through \texttt{"29"}. Now let's look at the \R{simulatedScores}: <>= data(simulatedScores) simulatedScores @ Let's check that the names of the scores match those of the sequences: <>= all(names(simulatedScores) == names(simulatedSeqs)) @ Looking back at the scores themselves, note that for this simple illustrative example the scores are just defined to be 0 for the first 20 sequences (\texttt{"0"}--\texttt{"19"}) and 1 for the last 10 sequences (\texttt{"20"}--\texttt{"29"}). \emph{Note regarding SArKS input}: the scores input to SArKS are not required to be positive, nor are they required to be integers. Similarly, the sequences input to SArKS do not have to use the DNA alphabet. \subsubsection{SArKS run parameters} \label{sec:par-advice} Aside from input of data in the form of sequences with matching scores, \R{sarks} also requires specification of run parameters \R{halfWindow}, \R{spatialLength}, and \R{minGini}. I would recommend that users try several (2-4) values of \R{halfWindow} and, if interested in spatial smoothing, \R{spatialLength}, using the method described in \ref{sec:pars} below. For \R{halfWindow}, you might start with \R{halfWindow} values on the order of \begin{equation} \label{eq:halfwindow-recommendation} \text{\R{halfWindow}} \in \left\{\frac{n}{20}, \, \frac{n}{10}, \, \frac{n}{5}\right\} \end{equation} where $n$ is the number of input sequences. To get a sense of the intuition behind the values in \eqref{eq:halfwindow-recommendation}: If we take \R{halfWindow}=$\frac{n}{20}$, we get a full smoothing window of size 2*\R{halfWindow}+1 $\approx \frac{n}{10}$, which amounts to looking for motifs which would be expected to occur about once in every 10 input sequences. The user should of course feel free to vary these fractions in either direction based on his or her own data and goals! Especially if the number $n$ of input sequences is very high, so that $\frac{n}{20}$ is very large (say, in the thousands or more), these \R{halfWindow} values may be larger than optimal. Another consideration to keep in mind when choosing \R{halfWindow} values is that there is a link between $n$, the average sequence length $|w|$, \R{halfWindow}, the size (really entropy) of the sequence alphabet and the length $\hat{k}$ of the $k$-mer motif sequences \R{sarks} is likely to detect. Assuming an approximately uniformly distributed alphabet (often \emph{not} a valid assumption in practice) of $a$ distinct characters: \begin{equation} \label{eq:rough-kmer-length} \hat{k} \approx \log_a \, \frac{n * |w|}{\text{\R{halfWindow}}} \end{equation} should give a very rough sense of what length the motifs returned are likely to have (though when employing spatial smoothing with merging of consecutive motifs, this will be less accurate). Equation \eqref{eq:rough-kmer-length} is mosly useful for understanding how changes to \R{halfWindow} are likely to impact output $k$-mer lengths, not truly predicting the exact lengths which will be seen. Aside from the trivial value of \R{spatialLength=0} used to turn off spatial smoothing entirely, it is a bit more difficult to provide general guidelines for \R{spatialLength}. Small values (say, 3--10) can be useful to help detect individual motifs even when no larger spatial structure is really expected; in \cite{wylie2019sarks} we also tried \R{spatialLength=100} when studying regulation of gene expression as that is on the order of the low end of the length DNA enhancer regions. Users should feel free to experiment with \R{spatialLength} values that make sense in their own applications. Finally, for \R{minGini}, my current advice would be to start with the value \R{minGini=1.1} (or \R{minGini=0} to see what happens without this filter). Interested users should consult sections \ref{sec:gini} and \ref{sec:permutation-distribution} (especially \ref{sec:mingini}) as well as \cite{wylie2019sarks} to get a better understanding of what this parameter does in order to get a sense of how they might choose better values for their own applications. \subsection{Output} \label{sec:output} \R{sarks} then aims to identify short sequence motifs, and potentially longer sequence domains enriched in such motifs (MMDs), where the occurrence of the identified motifs in the input sequences is correlated with the numeric scores assigned to the sequences. Let's consider a minimal \R{sarks} workflow (which we'll discuss in more detail below) to illustrate the way in which this output is constructed): <>= sarks <- Sarks(simulatedSeqs, simulatedScores, halfWindow=4) filters <- sarksFilters(halfWindow=4, spatialLength=0, minGini=1.1) permDist <- permutationDistribution(sarks, reps=250, filters, seed=123) thresholds <- permutationThresholds(filters, permDist, nSigma=2.0) peaks <- kmerPeaks(sarks, filters, thresholds) peaks[ , c('i', 's', 'block', 'wi', 'kmer', 'windowed')] @ \subsubsection{$k$-mers} For now let's focus specifically on the column \R{peaks\$kmer}, which tells us that SArKS has identified <>= unique(peaks$kmer) @ as $k$-mers whose occurrence in \R{simulatedSeqs} is associated with higher values of \R{simulatedScores}. \R{sarks} provides a convenience function \R{kmerCounts}: <>= kmerCounts(unique(peaks$kmer), simulatedSeqs) @ which shows us that each of the identified $k$-mers appears exactly once in each of the higher-scoring sequences (\texttt{"20"}-\texttt{"29"}) and never in any of the lower-scoring sequences. Looking a bit more closely at the specific $k$-mers identified here, you can see that they are all substrings of the longest one, the 10-mer \texttt{"CATACTGAGA"}, so that the perfect agreement between the columns of the \R{kmerCounts} output above are perhaps to be expected. \R{sarks} provides functionality to help to identify such situations and simplify the output, as discussed further in section \ref{sec:redundancy} below. First let's go back to the output \R{peaks} from the call to \R{kmerPeaks} (focusing on the $8^{\text{th}}$ row of \R{peaks}): <>= peak8 = peaks[8, ] peak8[ , c('i', 's', 'block', 'wi', 'kmer', 'windowed')] @ and consider the columns: \begin{description} \item[\R{i}] suffix array index \item[\R{s}] suffix array value $s_i$ \item[\R{block}] the name of the input sequence from which the indicated $k$-mer is derived \item[\R{wi}] the (0-based!) position $\omega_i$ of the $k$-mer within the indicated block/sequence \end{description} Considering \R{block} and \R{wi} first, this tells us that, setting <>= block <- simulatedSeqs[[peak8$block]] kmerStart <- peak8$wi + 1 kmerEnd <- kmerStart + nchar(peak8$kmer) - 1 @ (noting the \R{+ 1} required to define \R{kmerStart} in 1-based R relative to 0-based \R{wi}) we should find that <<>>= substr(block, kmerStart, kmerEnd) @ yields the same value as <<>>= peak8$kmer @ \subsubsection{Reducing redundancy in \R{peaks}} \label{sec:redundancy} In cases such as \texttt{"CATACTGAGA"} in the simulated data set analyzed here, \R{sarks} can simplify $k$-mer output using a couple of auxiliary functions: <>= nonRedundantPeaks <- mergedKmerSubPeaks(sarks, filters, thresholds) nonRedundantPeaks[ , c('i', 's', 'block', 'wi', 'kmer', 'windowed')] @ Here \R{mergedKmerSubPeaks} removes redundant $k$-mer output where multiple $k$-mer peaks are reported with successive spatial \R{s} coordinates (e.g., the reported peak \R{peaks[3, ]} from \R{block="22"} at \R{wi=195} and \R{kmer="ATACTGAGA"} immediately follows \R{peaks[8, ]} with \R{block="22"} and \R{wi=194} and \R{kmer="CATACTGAGA"}, and is thus merged with that peak). Further simplification is still possible in this case using: <>= extendedPeaks <- extendKmers(sarks, nonRedundantPeaks) extendedPeaks[ , c('i', 's', 'block', 'wi', 'kmer', 'windowed')] @ which detects that even though the full occurrence of \texttt{"CATACTGAGA"} in input sequence \texttt{"24"} was not reported as a $k$-mer by SArKS, the $k$-mer \texttt{"ACACTGAG"} in \R{nonRedundantPeaks[1, ]} is flanked in sequence \texttt{"24"} by a \texttt{"C"} to the left and an \texttt{"A"} to the right and can hence be extended to math the full motif. \subsubsection{Suffix arrays} \label{sec:suffix array} How are we to interpret the columns \R{i} and \R{s} in \R{peaks}? This requires understanding a bit more about SArKS: Specifically, that the method begins by concatenating all of the input sequences into one big string (called \java{catSeq} in the underlying java object referenced by \R{sarks}): <>= concatenated <- sarks$getCatSeq() nchar(concatenated) @ \R{concatenated} is actually a bit longer than the sum of the lengths of the input sequences because it keeps track of where one sequence ends and another begins using a special (dollar-sign) character. In this way, \R{concatenated} is divided into separate ``blocks,'' each corresponding to one of the input sequences. The column \R{s} in \R{peaks} then gives us the (0-based!) position of the annotated $k$-mer in the concatenated sequence: <<>>= kmerCatStart <- peak8$s + 1 kmerCatEnd <- kmerCatStart + nchar(peak8$kmer) - 1 substr(concatenated, kmerCatStart, kmerCatEnd) @ But what about \R{i}? As we will confirm, it is the (0-based) position of the suffix <<>>= theSuffix <- substr(concatenated, kmerCatStart, nchar(concatenated)) @ in the list of all suffixes \emph{after they have been lexicographically sorted}: <<>>= extractSuffix <- function(s) { ## returns suffix of concatenated starting at position s (1-based) substr(concatenated, s, nchar(concatenated)) } allSuffixes <- vapply(1:nchar(concatenated), extractSuffix, '') sortedSuffixes <- sort(allSuffixes) @ Sure enough, <<>>= i1based <- peak8$i + 1 sortedSuffixes[i1based] == theSuffix @ \subsubsection{Window averaging (kernel smoothing)} \label{sec:kern-smooth} Because the suffixes in \R{sortedSuffixes} are sorted, the suffixes in a window centered on \R{i1based} all start with the same few characters (a $k$-mer): <<>>= iCenteredWindow <- (i1based - 4):(i1based + 4) iCenteredWindowSuffixes <- sortedSuffixes[iCenteredWindow] all(substr(iCenteredWindowSuffixes, 1, 10) == 'CATACTGAGA') @ For each suffix in \R{sortedSuffixes}, we can identify which input sequence contributed the block of \R{concatenated} where the suffix begins: <>= iCenteredWindow0Based <- iCenteredWindow - 1 sourceBlock(sarks, i=iCenteredWindow0Based) @ Note that the 9 suffixes in \R{iCenteredWindowSuffixes} derive from come from 9 of the 10 higher-scoring sequences (all but \texttt{"23"}). This is not an accident: since the motif \texttt{"CATACTGAGA"} is only present in high-scoring sequences, the suffixes starting with \texttt{"CATACTGAGA"} must derive from blocks associated with high-scoring sequences. Thus the \emph{average} score of the sequences contributing the suffixes specified by \R{iCenteredWindowSuffixes} must be high (value of 1) as well. The SArKS algorithm turns this around to identify motifs by hunting for windows in the sorted suffix list where the average score of the corresponding \R{sourceBlock} sequences is high. The \R{windowed} column of the \R{peaks} output contains the average \R{sourceBlock} sequence scores for the windows centered around each sorted suffix position \R{i} and extending to both the left and right by \R{halfWindow=4} positions. These average values are also referred to as $\hat{y}_i$ in \cite{wylie2019sarks}; a vector containing all of these values can be obtained from the object \R{sarks}: <>= yhat <- sarks$getYhat() i0based <- seq(0, length(yhat)-1) plot(i0based, yhat, type='l', xlab='i') @ From the plot of $\hat{y}_i$ against $i$, we can see three peaks at which $\hat{y}_i$ spikes up to 1, corresponding to the ranges $i \in [1458,1463]$, $i \in [2256,2258]$, and $i \in [5862,5865]$ which make up the values in \R{peaks\$i}. \subsection{Permutations and thresholds} \label{sec:perm} How do we know that a given value of $\hat{y}_i$ (or, equivalently, \R{peaks\$windowed}) is high enough to include in the \R{peaks} output? In order to set such a threshold without having to make possibly unwarranted assumptions about the structure of the input sequences or the distribution of the scores, SArKS employs a permutational approach. To illustrate the idea here, let's randomly permute the scores---so that the permuted scores have no relationship with which sequences contain \texttt{"CATACTGAGA"}---and construct a new \R{Sarks} object using the permuted scores: <>= set.seed(12345) scoresPerm <- sample(simulatedScores) names(scoresPerm) <- names(simulatedScores) sarksPerm <- Sarks(simulatedSeqs, scoresPerm, halfWindow=4) @ If we try using the same procedure we did before to look for $k$-mer peaks: <>= permDistNull <- permutationDistribution( sarksPerm, reps=250, filters, seed=123) thresholdsNull <- permutationThresholds( filters, permDistNull, nSigma=2.0) peaksNull <- kmerPeaks(sarksPerm, filters, thresholdsNull) peaksNull[ , c('i', 's', 'block', 'wi', 'kmer', 'windowed')] @ we find that SArKS does not detect anything worthy of reporting after we disrupt any association between input sequences and input scores by permuting the scores. This is to be expected given that the thresholds $\theta$ used by SArKS are defined by considering many (here, \R{reps=250}) such permutations and choosing $\theta$ such that only very few random permutations would produce smoothed $\hat{y}_i$ scores exceeding $\theta$. The adjustable parameter \R{nSigma} controls how stringent the thresholds are: higher \R{nSigma} values lead to higher thresholds, reducing sensitivity but also reducing the false positive rate. Once we've set thresholds using \R{permutationThresholds}, we can estimate the false positive rate, defined here as the frequency of seeing nonempty $k$-mer result sets when there is really no association between sequence and score: <>= fpr = estimateFalsePositiveRate( sarks, reps=250, filters, thresholds, seed=321) fpr$ci @ indicating a point estimate of \Sexpr{signif(100*fpr$ci$mean, 2)}\% for the false positive rate, with a 95\% confidence interval of (\Sexpr{signif(100*fpr$ci$lower, 2)}\%, \Sexpr{signif(100*fpr$ci$upper, 2)}\%). This confidence interval can be made tighter by using a higher number for \R{reps} in the \R{estimateFalsePositiveRate} call. \emph{Note regarding random number generator seed for \R{estimateFalsePositiveRate}}: do not use the same seed for \R{estimateFalsePositiveRate} as was used in \R{permutationDistribution} call used to set thresholds. The false positive rate estimation procedure assumes that the random permutations used in \R{estimateFalsePositiveRate} are independent of those used to set thresholds. \subsubsection{Gini impurity filter} \label{sec:gini} When considering how to set SArKS thresholds in order to obtain a reasonable tradeoff between sensitivity and false positive rate, the mysterious \R{minGini} parameter should be factored in as well. This parameter is discussed in more detail in sections \ref{sec:windgini} and \ref{sec:mingini}, but the basic idea is to filter likely false positive suffix array index positions $i$ out of consideration regardless of their smoothed scores $\hat{y}_i$. These likely false positive indices $i$ come from excessive repetition of the same input sequences contributing repeatedly to the smoothing windows centered on $i$. My recommendation is to set \R{minGini} to a value slightly greater than 1: the value 1.1 seems empirically to work well in many situations. Note that for \R{minGini} > 1, this filter becomes more stringent the closer to 1 it is set; the opposite is true if you set \R{minGini} to a value less than 1, and you can turn the filter off entirely by setting it to 0. See section \ref{sec:mingini} and \cite{wylie2019sarks} for more details. Here we examine the effects of this parameter using the simulated data: <>= estimateFalsePositiveRate( sarks, reps=250, filters, thresholds, seed=321)$ci filtersNoGini <- sarksFilters(halfWindow=4, spatialLength=0, minGini=0) estimateFalsePositiveRate( sarks, reps=250, filtersNoGini, thresholds, seed=321)$ci @ Of course we could compute a new set of \R{thresholdsNoGini} using \R{filtersNoGini}, but the results of this on our ability to detect anything are worth considering: <>= permDistNoGini <- permutationDistribution(sarks, reps=250, filtersNoGini, seed=123) thresholdsNoGini <- permutationThresholds(filtersNoGini, permDistNoGini, nSigma=2.0) peaksNoGini <- kmerPeaks(sarks, filtersNoGini, thresholdsNoGini) peaksNoGini[ , c('i', 's', 'block', 'wi', 'kmer', 'windowed')] @ \subsubsection{Multithreading \R{sarks} permutational analyses} \label{sec:multithreading} Setting thresholds and estimating false positive rates using permutation methods is usually the most time-consuming step in the SArKS workflow. To facilitate more rapid turnaround, the java back end of \R{sarks} supports multithreading for these processes. In order to take advantage of multithreading, you just need to specify how many threads you'd like to use when you invoke the \R{Sarks} constructor using the \R{nThreads} argument to that function; all permutation steps performed using the resulting \R{sarks} object will then use the specified number of threads. \emph{Note regarding \R{sarks} multithreading}: while the different threads performing the permutational analyses share as many data structures as possible to reduce the memory requirements of \R{sarks}, each thread will need to keep track of its own permuted smoothed scores (and spatially smoothed scores if necessary). This can increase memory requirements quickly, so make sure you are running \R{sarks} on a machine with large RAM if you are planning to use many threads on a large data set. \subsection{Spatial smoothing} \label{sec:spatial} So far we have not taken advantage of the spatial smoothing features in SArKS. While one of the major uses of spatial smoothing is to detect multi-motif domains (MMDs), which are not present in the simple simulated data set included in the \R{sarks} package, spatial smoothing can also help sharpen the ability of SArKS to detect individual motifs: <>= sarks <- Sarks( simulatedSeqs, simulatedScores, halfWindow=4, spatialLength=3 ) filters <- sarksFilters(halfWindow=4, spatialLength=3, minGini=1.1) permDist <- permutationDistribution(sarks, reps=250, filters, seed=123) thresholds <- permutationThresholds(filters, permDist, nSigma=5.0) ## note setting of nSigma to higher value of 5.0 here! peaks <- kmerPeaks(sarks, filters, thresholds) peaks[ , c('i', 's', 'block', 'wi', 'kmer', 'spatialWindowed')] @ Note that here we use \R{nSigma=5.0}, a much more stringent setting than the \R{nSigma=2.0} value used when no spatial smoothing was employed; had we tried \R{nSigma=5.0} without spatial smoothing, SArKS would not have been able to detect the \texttt{"CATACTGAGA"} motif. The \R{peaks} object returned by \R{kmerPeaks} when spatial smoothing is employed contains the \R{i} and \R{s} coordinates for the left endpoints of spatial windows (windows in \R{s}-space, not \R{i}-space) enriched in high $\hat{y}$ scores. Especially when these windows have longer \R{spatialLength} values, it can also be useful to pick out individual motif ``subpeaks'' within these spatial windows: <>= subpeaks <- mergedKmerSubPeaks(sarks, filters, thresholds) subpeaks[ , c('i', 's', 'block', 'wi', 'kmer')] @ In this case, with \R{spatialLength=3}, this is not such an important step---although it does serve to accomplish some simplification of results as was described when \R{mergedKmerSubPeaks} was first introduced in section \ref{sec:redundancy}---but in general it is a critical piece of the SArKS methodology when spatial smoothing is in use. The \R{subpeaks} output of \R{mergedKmerSubPeaks} should generally be regarded as the main individual motif output for SArKS. Section \ref{sec:merged-kmer-subpeaks} provides more details. When employing spatial smoothing to identify MMDs, you may want to inspect individual input sequences of interest by visualizing the SArKS results: <>= block22 <- blockInfo(sarks, block='22', filters, thresholds) library(ggplot2) ggo <- ggplot(block22, aes(x=wi+1)) ## +1 because R indexing is 1-based ggo <- ggo + geom_point(aes(y=windowed), alpha=0.6, size=1) ggo <- ggo + geom_line(aes(y=spatialWindowed), alpha=0.6) ggo <- ggo + geom_hline(aes(yintercept=spatialTheta), color='red') ggo <- ggo + ylab('yhat') + ggtitle('Input Sequence "22"') ggo <- ggo + theme_bw() print(ggo) @ which here clearly shows the spike at the location \R{wi+1=195} of the \texttt{"CATACTGAGA"} motif in sequence \texttt{"22"}. The use of the more stringent \R{nSigma=5.0} setting reduces the false positive rate relative to the \R{nSigma=2.0} setting we used when not employing spatial smoothing: <>= estimateFalsePositiveRate( sarks, reps=250, filters, thresholds, seed=321)$ci @ \subsection{Varying SArKS parameters} \label{sec:pars} You may be wondering what the point of the \R{filters} object created in the SArKS workflow is, as it seems to specify the \R{halfWindow} and \R{spatialLength} parameters in a manner redundant to their specification in the \R{Sarks} constructor. Aside from the specification of the \R{minGini} parameter, the motivation for this step is that we usually don't know exactly what \R{halfWindow} or \R{spatialLength} values will yield the most useful output (and the answer to these questions will be different for different data sets and different questions). To address this, we can use \R{sarksFilters} to test a variety of combinations of \R{halfWindow} and \R{spatialLength} (and, if so desired, \R{minGini} as well) values like so: <>= filters <- sarksFilters( halfWindow=c(4, 8), spatialLength=c(2, 3), minGini=1.1) permDist <- permutationDistribution(sarks, reps=250, filters, seed=123) thresholds <- permutationThresholds(filters, permDist, nSigma=5.0) peaks <- mergedKmerSubPeaks(sarks, filters, thresholds) peaks[ , c('halfWindow', 'spatialLength', 'i', 's', 'block', 'wi', 'kmer')] @ Note that, as is generally true with SArKS, the results obtained using larger \R{halfWindow} values tend to have shorter identified $k$-mers. As always, testing multiple parameter sets can increase false positive rates: <>= estimateFalsePositiveRate( sarks, reps=250, filters, thresholds, seed=321)$ci @ As suggested in section \ref{sec:spatial} above, this can be countered by using increased \R{nSigma} values when setting thresholds if we plan to test a wider range of possible SArKS parameters. \emph{Note regarding multiple hypothesis testing in SArKS}: The false positive rates estimated by SArKS test the rate of detecting any results using any of the specified parameter settings, and are thus a type of family-wise error rate. As long as all parameter combinations tested are included in the \R{filters} employed in \R{estimateFalsePositiveRate}, no further adjustment for multiple testing should be applied to the estimated false positive rates. \subsection{Clustering similar $k$-mers into broader motifs} \label{sec:kmer-clustering} While there is essentially one unambiguous $k$-mer \texttt{"CATACTGAGA"} of note in the \R{simulatedSeqs}-\R{simulatedScores} example, real data sets will generally have more complex sequence motifs allowing for some variation both in length and composition of the patterns. When applying SArKS methodology to real data, many similar $k$-mers representing the same basic motif may result: \R{sarks} provides functions for clustering these $k$-mers into broader motif patterns. For example, let's say we had obtained the following $k$-mers using \R{sarks}: <>= kmers <- c( 'CAGCCTGG', 'CCTGGAA', 'CAGCCTG', 'CCTGGAAC', 'CTGGAACT', 'ACCTGC', 'CACCTGC', 'TGGCCTG', 'CACCTG', 'TCCAGC', 'CTGGAAC', 'CACCTGG', 'CTGGTCTA', 'GTCCTG', 'CTGGAAG', 'TTCCAGC' ) @ We could cluster these like so: <>= kmClust <- clusterKmers(kmers, directional=FALSE) ## directional=FALSE indicates that we want to treat each kmer ## as equivalent to its reverse-complement kmClust @ The resulting object \R{kmClust} is a named list: each element of this list is a character vector listing the elements of the vector \R{kmers} composing the corresponding cluster, while the name of the cluster is a $k$-mer from \R{kmers} found to be particularly representative of the cluster. \R{sarks} then allows us to count how many times each motif (=cluster of $k$-mers) occurs in each sequence: <>= clCounts <- clusterCounts(kmClust, simulatedSeqs, directional=FALSE) ## directional=FALSE to count hits of either a kmer from the cluster ## or the reverse-complement of such a kmer ## clCounts is a matrix with: ## - one row for each sequence in simulatedSeqs ## - one column for each *cluster* in kmClust head(clCounts) @ Can also get specific information about the location of these matches: <>= clLoci <- locateClusters( kmClust, simulatedSeqs, directional=FALSE, showMatch=TRUE ) ## showMatch=TRUE includes column specifying exactly what k-mer ## registered as a hit; ## this can be very slow, so default is showMatch=FALSE clLoci @ This shows us that, for example, there is one match for the cluster \texttt{"CAGCCTGG"} spanning sequence characters 143-150 of sequence \texttt{"25"}, and that this is an exact match of the $k$-mer \texttt{"CAGCCTGG"} in its forward orientation. These results also tells us that there are three matches of $k$-mers from the cluster \texttt{"CTGGAAC"}, in sequences \texttt{"6"}, \texttt{"16"}, and \texttt{"28"}. The specific $k$-mers found are different in each case: \begin{itemize} \item sequence \texttt{"6"} has a hit for the reverse-complement of $k$-mer \texttt{"TTCCAGC"}, while \item sequence \texttt{"16"} has a hit for $k$-mer \texttt{"CTGGAACT"} in reverse-complement orientation and \item sequence \texttt{"28"} has a hit for \texttt{"CTGGAAC"} also in reverse-complement orientation. \end{itemize} \section{How \R{sarks} works} \label{sec:how-it-works} \subsection{\R{Sarks} constructor} \label{sec:sarks-constructor} <>= sarks = Sarks( simulatedSeqs, simulatedScores, halfWindow=4, spatialLength=3, nThreads=4 ) @ The object \R{sarks} created by the \R{Sarks} constructor is an R representation of a java object which contains several attributes worth noting: these are described in the next few subsections, the headings of which match the names of the corresponding java Sarks class attributes. \subsubsection{\java{catSeq}} \label{sec:catseq} The concatenated sequence $x = w_0 * w_1 * \cdots * w_{n-1}$, which may be extracted using the R code \R{sarks\$getCatSeq()}. \subsubsection{\java{sa}} The suffix array $s_i$ mapping sorted suffix index $i$ to spatial position $s_i$ in the concatenated sequence $x$. Calculated using the java code internally implemented in \java{SkewSuffixArray} class implementing the skew algorithm initially described in \cite{karkkainen2003simple}. The suffix array may be extracted via the R code \R{sarks\$getSuffixArray()}. \subsubsection{\java{saInv}} \label{sec:sainv} The inverted suffix array $i_s$ mapping spatial position $s$ of the suffix formed by deleting the first $s$ characters of the concatenated sequence $x$ to the position $i$ of this suffix in the sorted list of all suffixes of $x$. The inverted suffix array may be extracted via the R code \R{sarks\$getInvertedSuffixArray()}. \subsubsection{\java{scores}} The array of sequence (block) scores $y_b$. May be extracted via R code \R{blockScores(sarks)}. \subsubsection{\java{windowed}} The kernel- (or window-)smoothed scores $\hat{y}_i$ defined by \begin{equation} \label{eq:yhat} \hat{y}_i = \frac{1}{2 \kappa + 1} \sum_{j=i-\kappa}^{i+\kappa} y_{b_j} \end{equation} where: \begin{description} \item[$\kappa$] is the specified \R{halfWindow} value and \item[$b_j$] is the input sequence block in which suffix with suffix array index $j$ begins. In R, the value of $b_j$ for any suffix array index $j$ can be assessed using the code \R{sourceBlock(sarks, i=j)}. \end{description} The array of kernel-smoothed scores $\hat{y}_i$ can be extracted using the R code \R{sarks\$getYhat()}. \subsubsection{\java{spatialWindowed}} The spatially-smoothed scores $\hat{\hat{y}}_i$ (here indexed by suffix array index $i$, not spatial position $s_i$) defined by \begin{equation} \label{eq:ydoublehat} \hat{\hat{y}}_i = \frac{1}{\lambda} \sum_{s=s_i}^{s_i+\lambda-1} \hat{y}_{i_s} \end{equation} where: \begin{description} \item[$\lambda$] is the specified \R{spatialLength} value and \item[$i_s$] is the sorted suffix index associated with the suffix starting at spatial position $s$ (obtained from the inverted suffix array described in section \ref{sec:sainv} above). \end{description} The array of spatially-smoothed scores $\hat{\hat{y}}_i$ can be extracted using the R code \R{sarks\$getYdoubleHat()}. \subsubsection{\java{windGini}} \label{sec:windgini} Array whose $i^{\text{th}}$ value is the Gini impurity value $g_i$ of smoothing window centered on suffix array index $i$, defined by \begin{equation} \label{eq:gini} g_i = \sum_b {f^{(i)}_b \left( 1 - f^{(i)}_b \right)} \end{equation} where: \begin{description} \item[$f^{(i)}_b = \frac{1}{2 \kappa + 1}{\sum\limits_{j=i-\kappa}^{j+\kappa} \delta_{b_j b}}$] is the fraction of positions within the smoothing window associated with sequence (or block) $b$ (note $\delta_{b_j b}$ is Kronecker delta taking value 1 if $b_j=b$ and 0 otherwise). \end{description} Low values of the Gini impurity values $g_i$ are used by SArKS to filter out likely false positive suffix array positions $i$: First note that equation \eqref{eq:yhat} may be rewritten \begin{equation} \label{eq:yhat-alt} \hat{y}_i = \frac{1}{2 \kappa + 1} \sum_b {f^{(i)}_b y_b} \end{equation} Now: \begin{itemize} \item shuffling the sequence/block scores $y_b$ by a random permutation $\Pi$, \item recomputing the resulting smoothed scores $\hat{y}^{(\Pi)}_i$ using equation \eqref{eq:yhat-alt}, \item noting that by symmetry $\mathbb{V}\left[y_{\Pi(b)}\right] = \mathbb{V}\left[y_{\Pi(b')}\right]$ for all sequences $b, b'$ under random permutation $\Pi$ and \item neglecting the small interdepence between $y_{\Pi(b)}$ and $y_{\Pi(b')}$ for distinct sequences $b \text{ and } b'$, \end{itemize} we can approximate \begin{equation} \label{eq:var-rel-gini} \mathbb{V} \left[ \hat{y}^{(\Pi)}_i \right] \propto \left[ f^{(i)}_b \right]^2 = 1 - g_i \end{equation} Equation \eqref{eq:var-rel-gini} tells us that, even under the null hypothesis of no association between sequence $w_b$ and sequence score $y_b$, at positions $i$ for which the Gini impurity $g_i$ is particularly far below the maximum value of 1, the smoothed scores will tend to be more extreme (high or low) than at other positions. Hence a high score $\hat{y}_i$ for such a position is less informative than would be the same score at a different position $j$. The array of Gini impurity values $g_i$ can be extracted using the R code \R{sarks\$getGini()}. \subsubsection{\java{spatGini}} Array whose $i^{\text{th}}$ value is the spatially averaged Gini impurity value $\bar{g}_i$ of smoothing window centered on suffix array index $i$, defined by \begin{equation} \bar{g}_i = \frac{1}{\lambda} \sum_{s=s_i}^{s_i+\lambda-1} g_{i_s} \end{equation} The spatially averaged Gini impurities are again used to filter out likely false-positive positions $i$, this time when spatial smoothing is employed to calculate $\hat{\hat{y}}_i$ values. The idea behind this filter is that, now neglecting the interdependence between $\hat{y}^{(\Pi)}_i$ and $\hat{y}^{(\Pi)}_{j}$ for distinct suffix array indices $i$ and $j$, we can get a crude estimate of the variance \begin{equation} \label{eq:var-spat} \mathbb{V} \left[ \hat{\hat{y}}^{(\Pi)}_i \right] = \mathbb{V} \left[ \frac{1}{\lambda} \sum_{s=s_i}^{s_i+\lambda-1} \hat{y}^{(\Pi)}_{i_s} \right] \end{equation} from \begin{equation} \label{eq:var-spat-approx} \frac{1}{\lambda^2} \sum_{s=s_i}^{s_i+\lambda-1} \mathbb{V} \left[ \hat{y}^{(\Pi)}_{i_s} \right] \propto 1 - \bar{g}_i \end{equation} The degree of approximation involved in the independence assumption required to pass from equation \eqref{eq:var-spat} to equation \eqref{eq:var-spat-approx} depends on how dissimilar the smoothing window compositions $f^{(i)}_b$ are for the various suffix array indices $i$ appearing in equation \eqref{eq:var-spat}. For the sake of simplicity and tractability, the approach taken in the current implementation of SArKS is to employ the spatial Gini filter as a heuristic parameter whose utility is to be assessed empirically via permutation testing. \subsection{\R{permutationDistribution}} \label{sec:permutation-distribution} Once the object \R{sarks} has been constructed using the \R{Sarks} constructor function, we can select a range of \R{halfWindow}, \R{spatialLength}, and \R{minGini} parameter values for motif and MMD selection. The R function \R{sarksFilters} puts all combinations of values of these values (specificied as numeric vectors in R) into a java object to pass along to downstream SArKS functions including \R{permutationDistribution}: <>= filters <- sarksFilters(halfWindow=4, spatialLength=c(0, 3), minGini=1.1) permDist <- permutationDistribution( sarks, reps=250, filters=filters, seed=123) @ Let $\left(\kappa^{(\alpha)}, \lambda^{(\alpha)}, g^{(\alpha)}_{\text{min}}\right)$ be the resulting SArKS \R{halfWindow}, \R{spatialLength}, and \R{minGini} parameters, with $\alpha$ here indexing the set of desired combinations. The \R{permutationDistribution} call generates \R{reps=250} random permutations $\pi_r$: \begin{itemize} \item for each of these permutations $\pi_r$ and \item for each combination of parameters $\left(\kappa^{(\alpha)}, \lambda^{(\alpha)}, g^{(\alpha)}_{\text{min}}\right)$, \end{itemize} it then computes \begin{itemize} \item the smoothed scores $\hat{y}^{(\alpha,\pi_r)}_i$ and, \item if applicable, the spatially-smoothed scores $\hat{\hat{y}}^{(\alpha,\pi_r)}_i$. \end{itemize} The resulting \R{permDist} object is a named list in R. The first two elements, `windowed' and `spatial' are data.frames with \R{reps=250} rows per combination of parameters in \R{filters}. Both of these have a column named `max' containing either \begin{itemize} \item \R{permDists\$windowed\$max}: \begin{equation} \label{eq:yhat-max} \hat{y}^{(\alpha, \pi_r)}_{\text{max}} \,=\, \underset{\{i \;\mid\; g_i \,\geq\, g^{(\alpha)}_{\text{min}}\}}{\operatorname{max}} \hat{y}^{(\alpha,\pi_r)}_i \end{equation} \item \R{permDists\$spatial\$max}: \begin{equation} \label{eq:ydoublehat-max} \hat{\hat{y}}^{(\alpha, \pi_r)}_{\text{max}} \,=\, \underset{\{i \;\mid\; \bar{g}_i \,\geq\, g^{(\alpha)}_{\text{min}}\}}{\operatorname{max}} \hat{\hat{y}}^{(\alpha,\pi_r)}_i \end{equation} \end{itemize} \subsubsection{Specification of $g^{(\alpha)}_{\text{min}}$ in \R{sarks}} \label{sec:mingini} SArKS currently allows the user to directly set $g^{(\alpha)}_{\text{min}} < 1$ excluding suffix array index values $i$ with $g^{(\alpha)}_i < g^{(\alpha)}_{\text{min}}$ from permutational analysis and $\hat{y}_i$ peak calling if so desired. Given the interpretation afforded by equation \eqref{eq:var-rel-gini}, however, it can be more convenient to indirectly specify $g^{(\alpha)}_{\text{min}}$ by choosing $\gamma$ in the following: \begin{equation} \label{eq:gmin-selection} 1 - g^{(\alpha)}_{\text{min}} = (1+\gamma) \left(1 - \underset{i}{\operatorname{median}} \, g^{(\alpha)}_i\right) \end{equation} Together equations \eqref{eq:var-rel-gini} and \eqref{eq:gmin-selection} imply that fixing $\gamma$ restricts analysis to suffix indices $i$ for which the variance of the permuted smoothed scores is at most $(1+\gamma)$ times the median such variance. If \R{minGini} is set to a value $\geq 1$, each of the $g^{(\alpha)}_{\text{min}}$ values will be set using equation \eqref{eq:gmin-selection} with $\gamma = $ \R{minGini} $\!- 1$. \subsection{\R{permutationThresholds}} <>= thresholds = permutationThresholds( filters=filters, permDist=permDist, nSigma=5.0) @ SArKS uses a simple method for setting thresholds $\theta^{(\alpha)}$ or $\theta^{(\alpha)}_{\text{spatial}}$ based on the set of randomly generated permutations $\left\{\pi_r\right\}$: \begin{align} \label{eq:theta} \theta^{(\alpha)} &= \underset{r}{\operatorname{mean}}\left\{ \hat{y}^{(\alpha, \pi_r)}_{\text{max}} \right\} + z \; \underset{r}{\operatorname{stdev}}\left\{ \hat{y}^{(\alpha, \pi_r)}_{\text{max}} \right\} \\ \label{eq:theta-spatial} \theta^{(\alpha)}_{\text{spatial}} &= \underset{r}{\operatorname{mean}}\left\{ \hat{\hat{y}}^{(\alpha, \pi_r)}_{\text{max}} \right\} + z \; \underset{r}{\operatorname{stdev}}\left\{ \hat{\hat{y}}^{(\alpha, \pi_r)}_{\text{max}} \right\} \end{align} The quantity $z$ appearing in equations \eqref{eq:theta}-\eqref{eq:theta-spatial} is specified by the \R{nSigma} argument to \R{permutationThresholds}. Higher values of $z$ trade reduced sensitivity for lower false positive rates. As currently implemented, equation \eqref{eq:theta} is only applied for parameter sets $\alpha$ for which no spatial smoothing is done (encoded as \R{spatialLength} or $\lambda^{(\alpha)}=0$). For those parameter sets $\alpha$ such that $\lambda^{(\alpha)} > 1$, equation \eqref{eq:theta-spatial} is instead applied to obtain $\theta^{(\alpha)}_{\text{spatial}}$, with $\theta^{(\alpha)} = -\infty$. \subsection{\R{kmerPeaks}} <>= peaks = kmerPeaks(sarks, filters=filters, thresholds=thresholds) @ If the option \R{peakify} is turned on (set to \R{TRUE}, as it is by default), the set of suffix array indices $I^{(\alpha)}$ or $J^{(\alpha)}_{\text{spatial}}$---depending on whether spatial smoothing is employed or not---defining the SArKS peak set for parameter combination $\alpha$ is determined by: \begin{align} \label{eq:I} I^{(\alpha)} &= \left\{i \; \middle| \; \left( \hat{y}^{(\alpha)}_i \geq \theta^{(\alpha)} \right) \, \land \, \left( \hat{y}^{(\alpha)}_{\eta(i)} \leq \hat{y}^{(\alpha)}_i \geq \hat{y}^{(\alpha)}_{\rho(i)} \right) \, \land \, \left( g^{(\alpha)}_i \geq g^{(\alpha)}_{\text{min}} \right) \right\} \\ \label{eq:J} J^{(\alpha)}_{\text{spatial}} &= \left\{i \; \middle| \; \left( \hat{\hat{y}}^{(\alpha)}_i \geq \theta^{(\alpha)}_{\text{spatial}} \right) \, \land \, \left( \hat{\hat{y}}^{(\alpha)}_{\eta(i)} \leq \hat{\hat{y}}^{(\alpha)}_i \geq \hat{\hat{y}}^{(\alpha)}_{\rho(i)} \right) \, \land \, \left( \bar{g}^{(\alpha)}_i \geq g^{(\alpha)}_{\text{min}} \right) \right\} \end{align} where: \begin{description} \item[$\eta(i)$] is the negative spatial shift operator defined by $\eta(i)=i_{s_i-1}$, and \item[$\rho(i)$] is the positive spatial shift operator defined by $\rho(i)=i_{s_i+1}$. \end{description} The condition that $\hat{y}^{(\alpha)}_{\eta(i)} \leq \hat{y}^{(\alpha)}_i \geq \hat{y}^{(\alpha)}_{\rho(i)}$ (or similar for $\hat{\hat{y}}^{(\alpha)}_{\eta(i)}$) restricts the peak set to only those suffix array indices $i$ for which there is not a higher smoothed score immediately spatially adjacent in either direction. If \R{peakify} is disabled, this condition is not required, so that instead \begin{align} \label{eq:I-no-peakify} I^{(\alpha)} &= \left\{i \; \middle| \; \left( \hat{y}^{(\alpha)}_i \geq \theta^{(\alpha)} \right) \, \land \, \left( g^{(\alpha)}_i \geq g^{(\alpha)}_{\text{min}} \right) \right\} \\ \label{eq:J-no-peakify} J^{(\alpha)}_{\text{spatial}} &= \left\{i \; \middle| \; \left( \hat{\hat{y}}^{(\alpha)}_i \geq \theta^{(\alpha)}_{\text{spatial}} \right) \, \land \, \left( \bar{g}^{(\alpha)}_i \geq g^{(\alpha)}_{\text{min}} \right) \right\} \end{align} is used to define $I^{(\alpha)}$ or $J^{(\alpha)}_{\text{spatial}}$. If no spatial smoothing is employed, we can use $\hat{k}^{(\alpha)}_i$ defined by \begin{equation} \label{eq:khat} \hat{k}^{(\alpha)}_i = \frac{1}{2 \kappa^{(\alpha)}} \sum\limits_{j=i-\kappa^{(\alpha)}}^{i+\kappa^{(\alpha)}} \, \left(1 - \delta_{ij}\right) \, \text{max} \; \{k \leq k_{\text{max}} \mid x[s_j, s_{j}+k) = x[s_i, s_{i}+k)\} \end{equation} to estimate the characteristic length $\lfloor \hat{k}^{(\alpha)}_i \rceil$ (where $\lfloor \cdots \rceil$ indicates nearest integer) of the $k$-mer $x[s_i, s_i+\lfloor \hat{k}^{(\alpha)}_i \rceil)$ associated with the smoothing window centered on the suffix with sorted index $i$. We can then identify the set of $k$-mers reported by SArKS using parameter set $\alpha$ when not using spatial smoothing (i.e., when \R{spatialLength} $\!= \lambda^{(\alpha)} = 0$) by: \begin{equation} \label{eq:M} M^{(\alpha)} = \left\{x[s_i, s_i+\lfloor \hat{k}^{(\alpha)}_i \rceil) \mid i \in I^{(\alpha)} \right\} \end{equation} \subsection{\R{mergedKmerSubPeaks}} \label{sec:merged-kmer-subpeaks} <>= mergedPeaks = mergedKmerSubPeaks(sarks, filters, thresholds) @ The set of suffix array indices marking the left ends of merged $k$-mer subpeaks when spatial smoothing is employed is given by: \begin{equation} \label{eq:I-spatial} I^{(\alpha)}_{\text{spatial}} = \left\{i \; \middle| \; \left(\delta^{(\alpha)}_i < \lambda^{(\alpha)} \right) \, \land \, \left( \hat{y}^{(\alpha)}_i \geq \theta^{(\alpha)}_{\text{spatial}} \right) \, \land \, \left[ \, \left( \delta^{(\alpha)}_{\eta(i)} \geq \lambda^{(\alpha)} \right) \lor \left( \hat{y}^{(\alpha)}_{\eta(i)} < \theta^{(\alpha)}_{\text{spatial}} \right) \, \right] \right\} % check < instead of \leq above... \end{equation} where \begin{equation} \delta^{(\alpha)}_j = \underset{i}{\text{min}} \left\{ s_j - s_i \; \middle| \; \left( i \in J^{(\alpha)}_{\text{spatial}} \right) \,\land\, \left( s_i \leq s_j\right) \right\} \end{equation} is the distance from spatial position $s_j$ to the nearest element of $J^{(\alpha)}_{\text{spatial}}$ spatially left of $s_j$, and \begin{description} \item[$\delta^{(\alpha)}_i < \lambda^{(\alpha)}$]: requires the suffix $i$ to be spatially located within one of the identified MMDs, \item[$\hat{y}^{(\alpha)}_i \geq \theta^{(\alpha)}_{\text{spatial}}$]: requires the singly-smoothed score $\hat{y}^{(\alpha)}_i$ to be above the threshold $\theta^{(\alpha)}_{\text{spatial}}$ \end{description} and finally, we require either that: \begin{description} \item[$\delta^{(\alpha)}_{\eta(i)} \geq \lambda^{(\alpha)}$]: $s_i$ is at the beginning of MMD, or \item[$\hat{y}^{(\alpha)}_{\eta(i)} < \theta^{(\alpha)}_{\text{spatial}}$]: score of suffix immediately spatially prior to $s_i$ is below threshold $\theta^{(\alpha)}_{\text{spatial}}$---idea here is that if this condition is not met we want to merge suffix at $s_i$ into suffix initiated spatially to left. \end{description} Once we have defined $I^{(\alpha)}_{\text{spatial}}$, SArKS estimates the associated $k$-mer lengths for merged spatially smoothed suffix array index peaks $i \in I^{(\alpha)}_{\text{spatial}}$ by \begin{equation} \label{eq:spatial-merged-k} \hat{\hat{k}}^{(\alpha)}_i = \underset{j}{\operatorname{max}}\left\{ \hat{k}^{(\alpha)}_j + s_j - s_i \; \middle| \; s \in [s_i, s_j] \implies \hat{y}^{(\alpha)}_{i_s} \ge \theta^{(\alpha)}_{\text{spatial}} \right\} \end{equation} where $\hat{k}^{(\alpha)}_j$ is defined by equation \eqref{eq:khat} and the condition $s \in [s_i, s_j] \implies \hat{y}^{(\alpha)}_{i_s} \ge \theta^{(\alpha)}_{\text{spatial}}$ requires that all spatial positions $s$ between $s_i$ and $s_j$ (including both $s_i$ and $s_j$) have smoothed scores $\hat{y}_{i_s}$ exceeding $\theta^{(\alpha)}_{\text{spatial}}$ (and should hence be merged together). The resulting motif set is then defined by: \begin{equation} \label{eq:m-spatial} M^{(\alpha)}_{\text{spatial}} = \left\{x\!\!\left[s_i, \, s_i+\left\lfloor \hat{\hat{k}}^{(\alpha)}_i \right\rceil\right) \; \middle| \; i \in I^{(\alpha)}_{\text{spatial}} \right\} \end{equation} \subsection{\R{estimateFalsePositiveRate}} <>= fpr = estimateFalsePositiveRate(sarks, reps=250, filters=filters, thresholds=thresholds, seed=123456) @ The false positive rate associated with a given set of parameters $\left\{ \left(\kappa^{(\alpha)}, \lambda^{(\alpha)}, g^{(\alpha)}_{\text{min}}\right) \right\}$ and thresholds $\left\{ \left( \theta^{(\alpha)}, \theta^{(\alpha)}_{\text{spatial}} \right) \right\}$ is estimated by \begin{itemize} \item generating a second (independent) permutation set $\{\pi'_r\}$, \item for each combination of parameters $\alpha$, use equation \eqref{eq:yhat-max} replacing $\pi_r$ with $\pi'_r$ to calculate $\hat{y}\,'^{(\alpha)}_{\text{max}}$ and equation \eqref{eq:ydoublehat-max} similarly to calculate $\hat{\hat{y}}\,'^{(\alpha)}_{\text{max}}$ (if $\lambda^{(\alpha)} > 1$; otherwise, take $\hat{\hat{y}}\,'^{(\alpha)}_{\text{max}} = \infty$). \end{itemize} The set of reps yielding false positive hits is calculated according to: \begin{equation} \label{eq:fpr-binomial} \text{false positives} = \left\{ \, r \; \middle| \; \bigvee_\alpha \left[ \, \left( \hat{y}\,'^{(\alpha)}_{\text{max}} \geq \theta^{(\alpha)} \right) \,\land\, \left( \hat{\hat{y}}\,'^{(\alpha)}_{\text{max}} \geq \theta^{(\alpha)}_{\text{spatial}} \right) \, \right] \right\} \end{equation} where we again take either $\theta^{(\alpha)}=-\infty$ when $\lambda^{(\alpha)}=0$ or $\theta^{(\alpha)}_{\text{spatial}}=-\infty$ when $\lambda^{(\alpha)}>1$, so that, depending on the value of $\lambda^{(\alpha)}$, one or the other of the two logical expressions inside the square brackets in equation \eqref{eq:fpr-binomial} is trivially true for each $\alpha$. The false positive rate can then be estimated by comparing the number of false positive results with the number \R{reps} of permutations $\pi'_r$ generated. SArKS uses \R{binom::binom.exact} to estimate confidence intervals for the false positive rate according to the Pearson-Klopper method. \emph{Note regarding random number generator seeds}: In order that the permutation set $\{\pi'_r\}$ be independent of the initial permutation set $\{\pi_r\}$ used to select thresholds $\left\{ \left( \theta^{(\alpha)}, \theta^{(\alpha)}_{\text{spatial}} \right) \right\}$, you should make sure that you do not repeat the same seed for the random number generator in the calls to \R{permutationDistribution} and \R{estimateFalsePositiveRate}. \section{Session Info} <>= sessionInfo() @ \section{Notation glossary} \label{sec:equation-glossary} \small \begin{tabular}[H!]{rp{113.5mm}} $\lfloor r \rceil$ & nearest integer to real number $r$ \\ \hline $u * v$ & concatenation of strings $u$ and $v$ \\ $|u|$ & length of string $u$ \\ $u[i, j)$ & substring of $u$ starting at $i^{\text{th}}$ character (inclusive) and continuing up until $j^{\text{th}}$ character (exclusive) using 0-based indexing \\ $u[i, j]$ & substring of $u$ starting at $i^{\text{th}}$ character (inclusive) and continuing through the $j^{\text{th}}$ character (inclusive) using 0-based indexing \\ \hline $\{a \, | \, C(a)\}$ & set containing all elements $a$ satisfying condition $C(a)$ \\ $\{a \, | \, C_1(a) \, \land \, C_2(a) \}$ & set containing all elements $a$ satisfying conditions $C_1(a)$ and $C_2(a)$ \\ $\{a \, | \, C_1(a) \, \lor \, C_2(a) \}$ & set containing all elements $a$ satisfying conditions $C_1(a)$ or $C_2(a)$ \\ \hline $\mathbb{E}[A]$ & expectation value of random variable $A$ \\ $\mathbb{V}[A]$ & variance of random variable $A$ \\ \hline $i$ & suffix array index: 0-based position of suffix in lexicographically sorted list of all suffixes of string $x$ \\ $s_i$ & suffix array value: 0-based spatial position of suffix with suffix array index $i$ within string $x$ \\ $i_s$ & suffix array index $i$ associated with spatial position $s$ \\ $b_i$ & block array value: 0-based position of block/word in which suffix with suffix array index $i$ begins \\ $\omega_i$ & 0-based position of suffix with suffix array index $i$ within block $b_i$ \\ \hline $\hat{y}_i$ & kernel smoothed score associated with suffix array index $i$ \\ $\kappa$ & half-width of kernel applied to generate $\hat{y}_i$ \\ $\hat{k}_i$ & estimate of smoothed $k$-mer length at suffix array index $i$ \\ $\eta(i)$ & negative spatial shift operator defined by $\eta(i)=i_{s_i-1}$ \\ $\rho(i)$ & positive spatial shift operator defined by $\rho(i)=i_{s_i+1}$ \\ $\theta$ & threshold value for $\hat{y}_i$ for sequence-smoothed peak calling \\ $I$ & set of suffix array indices identified as peaks by SArKS \\ $M$ & set of $k$-mer motifs derived from suffix array peak set \\ \hline $f^{(i)}_b$ & weighted frequency of block/word $b$ within smoothing window centered on suffix array index $i$ \\ $g_i$ & Gini impurity of smoothing window centered on suffix array index $i$ \\ $g_{\text{min}}$ & minimum value of smoothing window Gini impurity for inclusion in peak set $I$ \\ $\bar{g}_i$ & spatially-averaged Gini impurity over spatial window starting at position $s_i$ \\ $\bar{g}_{\text{min}}$ & minimum value of spatially-averaged Gini impurity for inclusion in peak set $J_{\text{spatial}}$ \\ \hline $\hat{\hat{y}}_i$ & spatially smoothed score associated with suffix array index $i$ \\ $\lambda$ & length of spatial kernel applied to generate spatially smoothed scores $\hat{\hat{y}}_i$ \\ $\hat{\hat{k}}_i$ & estimate of merged $k$-mer length at suffix array index $i$ \\ $\theta_{\text{spatial}}$ & threshold value for $\hat{\hat{y}}_i$ to call significant spatial windows \\ $J_{\text{spatial}}$ & set of suffix array indices identified as spatial window starting positions \\ $I_{\text{spatial}}$ & set of suffix array indices identified as $k$-mer starting positions using spatial smoothing \\ $M_{\text{spatial}}$ & set of $k$-mer motifs derived from suffix array index set $I_{\text{spatial}}$ using spatial smoothing \\ \hline $\pi$ & permutation of $n$ blocks/words \\ $\Pi$ & random variable representing randomly generated permutation \\ $\hat{y}^{(\pi)}_i$ & sequence smoothed scores calculated with word scores permuted by $\pi$ \\ $\hat{\hat{y}}^{(\pi)}_i$ & spatially smoothed scores calculated with word scores permuted by $\pi$ \end{tabular} \pagebreak \small \bibliographystyle{abbrvnat} \bibliography{sarks} \vspace{10cm} \addcontentsline{toc}{section}{References} \end{document}