\documentclass[a4paper]{article} \usepackage{Sweave, amssymb, amsmath, url} \title{The \emph{genomeIntervals} package} \author{Julien Gagneur} \date{20 June 2009} % The is for R CMD check, which finds it in spite of the "%", and also for % automatic creation of links in the HTML documentation for the package: % \VignetteIndexEntry{Overview of the genomeIntervals package.} \begin{document} %%%%%%%% Setup % Don't reform code \SweaveOpts{keep.source=TRUE} % Size for figures \setkeys{Gin}{width=.75\textwidth} % Reduce characters per line in R output <>= options( width = 80 ) @ % Make title \maketitle % Typesetting commands \newcommand{\Z}{\mathbb{Z}} %%%%%%%% Main text \section{Introduction} Genomic intervals arise in many contexts, such as genome sequence annotations (exons, introns, promoters, etc.) or experimental results of genomic studies (transcripts, ChIP-on-chip enriched regions, etc.). Often, operations over collections of genomic intervals --- such as merging, overlap or non-overlap detection, or the computation of distances between intervals --- are needed. The \emph{genomeIntervals} package provides tools for this. It relies on the package \emph{intervals}, which works with general numerical intervals, and provide wrappers for most of its functions, making them easy to use in a genomic context. \section{Genome intervals classes} We think of genomic sequences as sequences of nucleotides. These intervals are mathematically represented as intervals over the integers, $\Z$, with all possible types of left and right closure permitted. (See the example which follows.) The S4 class \texttt{Genome\_intervals} represents a collection of genomic intervals by extending the class \texttt{Intervals\_full} from the \emph{intervals} package. Each genome interval has a \texttt{seq\_name} that represents its chromosome or, more generally, its sequence of origin. The S4 class \texttt{Genome\_intervals\_stranded} represents genomic intervals which are strand specific. Below, we load and show the \texttt{Genome\_intervals\_stranded} object \texttt{i}, a toy example provided with the dataset \texttt{gen\_ints}. <>= library( genomeIntervals ) data("gen_ints") i @ \texttt{Genome\_intervals} can have \texttt{rownames} (e.g., \texttt{"i.gene.1"}), which behave in the same way as \texttt{matrix} rownames: they are not mandatory and need not be unique or be supplied for all rows. The left and right end points of each interval can be accessed and modified using standard column subsetting. Their closure status can be accessed and modified similarly, using the \texttt{closed} accessor. <>= i[,1] i[,2] closed(i) closed(i)[2,2] <- FALSE @ Closure status can be adjusted quickly for all intervals in an object by supplying only two values. In this case, the two values are assumed to correspond to the left and right end points. (This is not R's standard recycling behavior, but is far more useful here.) <>= i2 <- i closed(i2) <- c( TRUE, FALSE ) i2 @ Sequence name and strand data can be accessed and modified with the \texttt{seq\_name} and \texttt{strand} accessors: <>= seqnames(i) strand(i) strand(i)[2] <- "-" @ Objects can be combined using \texttt{c}. Subsetting by row returns objects of the same class as the original object. Below we use the \texttt{Genome\_intervals\_stranded} object \texttt{j}, also provided with the dataset \texttt{gen\_ints}. <>= j c( i[1:3,], j[1:2,] ) @ The slot \texttt{annotation} is a \texttt{data.frame} that stores \texttt{seq\_name}, \texttt{strand} and the \texttt{inter\_base} logical vector (explained later). Additional columns may be added for extra, user-defined annotation of the intervals. Subsetting annotated objects does what it should: <>= annotation(i) annotation(i)$myannot = rep( c("my", "annot"), length=nrow(i) ) annotation(i[2:3,]) @ Columns of the slot \texttt{annotation} can be directly accessed and modified via the \texttt{[[} and the \texttt{\$} operators. <>= i$myannot i[["myannot"]] @ The \texttt{close\_intervals} method returns a representation which is adjusted to have closed left and right end points, standardizing results. Note that \texttt{close\_intervals} does not change the content of the intervals, only their representation. The companion methods \texttt{open\_intervals} and \texttt{adjust\_closure}, also imported from the package \texttt{intervals}, permit the other transformations. <>= close_intervals(i) @ We define the \emph{size} of a genomic interval to be the number of bases it contains. <>= size(i) @ Constructing a \texttt{Genome\_intervals} or a \texttt{Genome\_intervals\_stranded} from scratch is done by a call to \texttt{new} providing the matrix of end points, the matrix of closures (or faster a single value as shown below) and the \texttt{annotation} data frame. For \texttt{Genome\_intervals\_stranded}, make sure the \texttt{strand} column of the \texttt{annotation} data frame is a factor with two levels. <>= new( "Genome_intervals_stranded", matrix(c(1, 2, 2, 5), ncol = 2), closed = TRUE, annotation = data.frame( seq_name = factor(c("chr01","chr02")), inter_base = FALSE, strand = factor( c("+", "+"), levels=c("+", "-") ) ) ) @ \section{Overlap and set operations} \subsection{Overlap} The \texttt{interval\_overlap} method identifies, for each interval of the \texttt{from} object, all overlapping intervals from the \texttt{to} object. It works by \texttt{seq\_name} and (if applicable) \texttt{strand}. <>= interval_overlap( from=i, to=j ) @ \subsection{Set operations} The \emph{interval-union}, obtained by a call to \texttt{interval\_union}, is defined as the union of all intervals of a \texttt{Genome\_intervals} object, computed by strand and sequence name. Note that the output of \texttt{interval\_union} does not include row names or any extra annotation beyond sequence name and strand, since the union process typically combines intervals, making old annotation inappropriate. <>= interval_union(i) @ The intersection between intervals of two (or more) \texttt{Genome\_intervals} objects can be obtained by using \texttt{interval\_intersection}; the complement of a \texttt{Genome\_intervals} object can be obtained by using \texttt{interval\_complement}. (\emph{Note:} \texttt{interval\_complement} currently resorts to \texttt{-Inf} and \texttt{Inf} for outer end points. This function will be improved in a future release to utilize known chromosome extents.) <>= interval_intersection(i,j) interval_complement(j[1:2,]) @ \section{Distance} The function \texttt{distance\_to\_nearest} gives the distance from each \texttt{from} interval to the nearest \texttt{to} interval. The absolute distance is returned; signed distance (reporting whether the nearest is in 5' or 3' direction, or producing a result for both directions) is left for a future release. <>= distance_to_nearest(i,j) @ Note that the distance to nearest of a \texttt{from} interval equals $0$ if and only if at least one of the \texttt{to} intervals overlaps with it. \section{Inter-base intervals} It is sometimes useful to define positions \emph{between} two nucleotides --- to represent, for example, an insertion point or an enzyme restriction site. (The GFF format provides support for such positions.) To deal with this, we consider two types of positions along genomic sequences. The \emph{base} positions are directly at the nucleotides; all examples shown so far in this vignette deal with base positions. The \emph{inter-base} positions, on the other hand, fall between two consecutive nucleotides. Specifically, we define an inter-base position at $i$ to lie between bases $i$ and $i+1$. We then consider two types of genomic intervals: base intervals (composed of consecutive base positions) and inter-base intervals (consecutive inter-bases). The object \texttt{k} of the data set \texttt{gen\_ints} contains both base and inter-base intervals. Inter-base intervals are indicated in the display with an asterisk. The inter-base status of a given interval can be retrieved and modified using the \texttt{inter\_base} accessor function. In the next example, inter-base intervals represent two insertion points, between bases 1 and 2, and between bases 8 and 9. <>= k inter_base(k) k[inter_base(k),] @ Because size is defined as the number of bases an interval contains, size is by definition 0 for all inter-base intervals. <>= size(k) @ Base and inter-base intervals can overlap: <>= interval_overlap(j,k) @ The distance between a base and the inter-base on either side is defined to be $0.5$. Thus, distances from one base to another base, or from one inter-base to another inter-base, are integer valued; distances from a base to an inter-base, on the other hand, are half-integers. Note that overlapping intervals of any type are at distance $0$ from each other. <>= distance_to_nearest(j,k) @ Set operations are computed for the base and the inter-base intervals independently, preserving the distinction between the two interval types: <>= interval_union(k) interval_intersection(k,j) interval_complement(k[1:2,]) @ \section{Reading and handling GFF files} Files in the GFF3 format can be loaded by the function \texttt{readGFF3}. The companion functions \texttt{parseGffAttributes} and \texttt{getGffAttribute} provide parsing facilities for GFF attributes. To demonstrate these functions, the package \emph{genomeIntervals} comes with a simplified GFF file derived from the yeast genome database SGD (\url{http://yeastgenome.org}). <>= libPath <- installed.packages()["genomeIntervals", "LibPath"] filePath <- file.path( libPath, "genomeIntervals", "example_files" ) gff <- readGff3( file.path( filePath, "sgd_simple.gff" ), isRightOpen=FALSE,quiet=TRUE ) idpa = getGffAttribute( gff, c( "ID", "Parent" ) ) head(idpa) @ \section{Session information} @ <>= si <- as.character( toLatex( sessionInfo() ) ) cat( si[ -grep( "Locale", si ) ], sep = "\n" ) @ \end{document}