%\VignetteIndexEntry{SAGEnhaft} %\VignetteKeyword{annotation} %\VignettePackage{SAGEnhaft} \documentclass[12pt]{article} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \begin{document} \author{Tim Beissbarth} \title{Introduction to SAGEnhaft} \maketitle \section{Overview} Serial Analysis of Gene Expression (SAGE) is a gene expression profiling technique that estimates the abundance of thousands of gene transcripts (mRNAs) from a cell or tissue sample in parallel (Velculescu, 1995). SAGE is based on the sequencing of short sequence tags that are extracted at defined positions of the transcript. As opposed to DNA microarray technology SAGE does not require prior knowledge of the transcripts, and results in an estimate of the absolute abundance of a transcript. However, due to sequencing errors a proportion of the low abundance tags do not represent real genes altering the ability of SAGE to estimate the number of transcripts that have been observed. Moreover, loss of ``true''-tags due to sequencing errors will result in altered numbers for the abundance of genuine transcripts. \section{SAGE} Briefly, SAGE works as follows: RNAs from either cells or tissues are converted to double stranded cDNA which is anchored to a solid phase at the 3'~end. The double stranded cDNA is then cleaved with a restriction endonuclease at a 4~bp recognition sequence, most commonly CATG. The 3'~ends of these cDNA fragments are collected and are then divided into two populations and ligated to linkers containing a type IIS restriction endonuclease recognition sequence, where the enzyme cleaves up to 20~bp away from their recognition site. The two populations are ligated together and amplified by PCR, resulting in two tags orientated tail to tail with an anchoring enzyme recognition site at either end. Two types of SAGE libraries are commonly used, generating tags of different length, i.e. 10~base and 17~base tags respectively, depending on the enzyme used. For protokolls see http://www.sagenet.org/sage\_protocol.htm. \section{Base-calling and extraction of SAGE tags} SAGE libraries are generated from between 1,000 to 5,000 sequenced clones, with each sequence run consisting of up to 40 tags. Automated sequencers generate a four-color chromatogram showing the results of the sequencing gel. These chromatograms are read by the Phred or ABI software to call bases and assign an error estimate for each base. These two base-calling programs, the open source program Phred and the ABI KB basecaller, distributed with the ABI 3730 sequencing machines (http://www.appliedbiosystems.com), both assign a quality score to each sequenced base (Ewing 1998). The quality score is given as $-10 \log_{10} P_e$, where $P_e$ is the probability of a base-calling error. The resulting Phred or ABI files are read by functions implemented in this package which subsequently extract the ditags and tags between the anchoring enzyme sites (CATG) in the sequence, keeping the error scores with each base. Ditags have to be within a specified length range, e.g. 20-24 bases for 10 base tags or 32-38 bases for 17 base tags. Duplicate ditags are removed to reduce possible PCR bias, keeping the ditag with the highest average sequencing quality. Tag sequences with a low average sequence quality ($\le 10$) are also removed. From experimental SAGE libraries usually 20,000-100,000 tag sequences are generated. \section{Sequence Error correction} Sequencing errors may bias the gene expression measurements made by SAGE. They may introduce non-existent tags at low abundance and decrease the real abundance of other tags. These effects are increased in the longer tags generated in LongSAGE libraries. Current sequencing technology generates quite accurate estimates of sequencing error rates. Here we make use of the sequence neighborhood of SAGE tags and error estimates from the base-calling software to correct for such errors. We introduce a statistical model for the propagation of sequencing errors in SAGE and suggest an Expectation-Maximization (EM) algorithm to correct for them given observed sequences in a library and base-calling error estimates. For details see: Statistical modeling of sequencing errors in SAGE librarie, \textbf{Beissbarth T, Hyde L, Smyth GK, Job C, Boon WM, Tan SS, Scott HS, Speed TP}, Bioinformatics; 7.2004; 20(ISMB Supplement), in press. \section{Comparison of SAGE libraries} SAGE tags are assessed for differential expression between two SAGE libraries by computing Fisher's Exact test for each unique tag. If a particular tag has count $n_A$ in library A and count $n_B$ in library B, and if the total number of sequences counted is $t_A$ for library A and $t_B$ for library B, then Fisher's Exact test is computed to test for independence in the $2 \times 2$ contingency table with counts $n_A$, $n_B$, $t_A - n_A$ and $t_B - n_B$. This results in a $p$-value for the null hypothesis of no differential expression for each gene. Since the tests for different tags are almost independent, the method of Benjamini and Hochberg (1995) was used to control the false discovery rate (fdr). Fisher's Exact test has been found to be slow to compute but an exact binomial test proved to be an excellent approximation when $t_A$ and $t_B$ are large and large relative to $n_A$ and $n_B$, as they are for typical SAGE libraries. This test is defined similarly to Fisher's Exact test but with binomial probabilities replacing the hypergeometric probabilities. We used a vectorized version of the binomial exact test to allow rapid computation for complete libraries. By analogy with microarray analysis the relative difference of a tag between two libraries is summarized by an $M$ value, which is calculated as $\log_2(n_A+0.5)+\log_2(t_B-n_B+0.5)-\log_2(n_B+0.5)-\log_2(t_A-n_A+0.5)$, and the mean absolute expression is summarized as an $A$ value, which is calculated as $0.5 (\log_2(n_A (t_A+t_B)/2t_A + 0.5) + \log_2(n_B(t_A+t_B)/2t_B +0.5))$. We call changes with a fdr of less than 0.1 significant. \section{Example} The E15 library was generated from posterior cortex of embryonic C57/BL6 mice at stage E15.5. The B6Hypothal library was generated from hypothalamus of 8 week old C57/BL6 mice. <<>>= library(sagenhaft) @ \begin{Schunk} \begin{Sinput} > file.copy(system.file("extdata", "E15postHFI.zip", package="sagenhaft"), + "E15postHFI.zip") > E15post<-extract.lib.from.zip("E15postHFI.zip", taglength=10, + min.ditag.length=20, max.ditag.length=24) \end{Sinput} \end{Schunk} <<>>= E15post <- read.sage.library(system.file("extdata", "E15postHFI.sage", package="sagenhaft")) E15post B6Hypo <- read.sage.library(system.file("extdata", "B6HypothalHFI.sage", package="sagenhaft")) libcomp <- compare.lib.pair(B6Hypo, E15post) plot(libcomp) libcomp @ \begin{Schunk} \begin{Sinput} > testlib <- combine.libs(B6Hypo, E15post) > testlib <- estimate.errors.mean(testlib) > testlib <- em.estimate.error.given(testlib) > tagneighbors <- compute.sequence.neighbors(testlib$seqs[,"seq"], 10, + testlib$seqs[,paste("q", 1:10, sep="")]) \end{Sinput} \end{Schunk} \end{document}