%\VignetteIndexEntry{IVAS : Identification of genetic Variants affecting Alternative Splicing} %\VignettePackage{IVAS} \documentclass[a4paper,11pt]{article} \usepackage{graphicx} \usepackage{a4wide} \graphicspath{{./result/}} <>= BiocStyle::latex(use.unsrturl=FALSE) @ \title{\Rpackage{IVAS} : Identification of genetic Variants affecting Alternative Splicing} \author{Seonggyun Han and Sangsoo Kim} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents \section{Introduction} Alternative splicing controls relative expression ratios of mature mRNA isoforms from a single gene. Mapping studies of Splicing Quantitative Trait Loci (SQTL), a genetic variant affecting the alternative splicing, are important steps to understand gene regulations and protein activity \cite{KeyanZhao}. We present an effective and user-friendly computational tool to detect SQTLs using transcript expression data from RNA-seq and genotype data, both measured on the same sample. As RNA sequencing (RNA-seq) provides insight into relatively precise measurments of expression level of transcript isoforms from a gene, it is a useful tool to analyze complicated biological phenomenon of RNA transcripts including the alternative splicing \cite{JosephK}. The mapping analysis uses two statistical models : Linear regression model \cite{Breslow} and/or Generalized linear mixed model \cite{Chambers}. \section{The input data set} The next subsection introduces the input data. To run this tool, two experimental data sets (an expression data frame from RNA-seq and a genotype data frame) are required. Moreover, we also need a data frame for positions of SNP markers and GTF file for transcript models. As any other genome-wide analyses, it is recommended to use as many samples as possible, usually of population scale, in order to guarantee a statistically significant result. \subsection{The genotype data} The genotype data should be prepared as a simple matrix data. Each column represents an individual and its name should match that of the expression matrix described below (2.2) \begin{table}[h!] \begin{tabular}{c c c c c} &ind1&ind2&ind3&ind4\\ SNP1&AA&AA&AT&TT\\ SNP2&CG&CC&GG&CG\\ SNP3&TT&TT&AT&TT\\ \end{tabular} \end{table} \subsection{The expression data} The expression matrix must comprise expression values of transcripts from RNA-seq. We may obtain them by using alignment tools such as cufflinks. Each column represents an individual and its name should match that of the genotype matrix described above (2.1) \begin{table}[h!] \begin{tabular}{c c c c c} &ind1&ind2&ind3&ind4\\ transcript1&10.5&15.4&6.7&12.4\\ transcript2&6.4&7.2&4.5&9.2\\ transcript3&15.4&14.5&13.2&17.8\\ \end{tabular} \end{table} \subsection{The SNP marker position data} To search SNPs affecting alternative splicing, a data frame comprising genomic location of each SNP is required. It consists of following columns: SNP (SNP marker name), CHR(chromosome number), and locus(SNP position). \begin{table}[h!] \begin{tabular}{c c c} SNP&CHR&locus\\ SNP1&1&4964005\\ SNP2&1&23513047\\ \end{tabular} \end{table} \subsection{The transcripts model data} We need a reference GTF (General Feature Format) file including information about gene structures such as the positions of exons, introns, and transcripts of genes. The GTF file must be \Robject{TxDb} object from the \Biocpkg{GenomicFeatures} package \cite{Michael}. \section{The example dataset : data from Geuvadis RNA sequencing project of 1000 Genome samples} This example uses filtered data from an origin data generated by Geuvadis RNA sequencing project, available at \texttt{http://www.geuvadis.org/web/geuvadis/RNAseq-project} \cite{TuuliLappalainen}. The example expression data includes transcripts of 11 randomly selected genes. The genotype data comprises SNPs in those genes. \section{Loading data} For this analysis, you need to load the \Rpackage{IVAS} package, SNP data, expression data, SNP position data, and \Robject{TxDb} object from GTF. \\ Loading \Rpackage{IVAS} package : <>= library(IVAS) @ Loading expression data : <>= data(sampleexp) @ Loading SNP data : <>= data(samplesnp) @ Loading SNP position data : <>= data(samplesnplocus) @ Loading \Robject{TxDb} object : <>= sampleDB <- system.file("extdata","sampleDB", package="IVAS") sample.Txdb <- loadDb(sampleDB) @ If you want to create the \Robject{TxDb} object from a GTF file, you need to use the \Rfunction{makeTxDbFromGFF} function in the \Biocpkg{GenomicFeatures} package. \section{The \Robject{ASdb} obect} The \Robject{ASdb} object is a s4 type class object, and the object is used by the \Rpackage{IVAS} package to store the results from functions in this \Rpackage{IVAS} package. The functions of \Rpackage{IVAS} will save their results by adding a slot. Each slot contains a list object consisting of three elements named as "ES", "ASS", and "IR" for each alternatively splicing pattern type (i.e. ES, ASS, and IR means exon skipping, alternative splice site, and intron retention, respectively). \subsection{Searching alternatively spliced exons based on a reference transcript model.} The \Rfunction{Splicingfinder} function tabulates patterns of alternatively spliced exons. This needs the \Robject{TxDb} object from \Rfunction{makeTxDbFromGFF} by reading a reference GTF file for reference transcript models. The \Rfunction{Splicingfinder} function categorizes alternatively spliced exons into four types of AS patterns (i.e. exon skipping, alternative 3-prime splice site, alternative 5-prime splice site, and intron retention). The result will be saved in the "SplicingModel" slot of \Robject{ASdb}. \\ To use this function : <>= ASdb <- Splicingfinder(GTFdb=sample.Txdb,calGene=NULL,Ncor=1,out.dir=NULL) ASdb head(slot(ASdb,"SplicingModel")$"ASS") @ You are able to define only a single gene if the single gene is inputted. The first column, named by "Index", is a generally used as an identifier and commonly used in other functions of \Rpackage{IVAS}. \subsection{Estimating expression ratio of AS exons with a data set including FPKM values of transcripts} The \Rfunction{RatioFromFPKM} function calculates expression ratio between transcripts with and without alternatively spliced exons. First, \Rfunction{RatioFromFPKM} divides the isoforms from a single gene into two groups: transcripts with and without an alternatively spliced exon. Then, the ratio of the group totals of transcript FPKM values is calculated. The \Rfunction{RatioFromFPKM} requires expression data set of transcript FPKM values and \Robject{ASdb} with the "SplicingModel" slot. The result will be saved in the "Ratio" slot of \Robject{ASdb} <>= ASdb <- RatioFromFPKM(GTFdb=sample.Txdb,ASdb=ASdb,Total.expdata=sampleexp, CalIndex="ASS7",Ncor=1,out.dir=NULL) ASdb head(slot(ASdb,"Ratio")$"ASS") @ In this example, we will estimates ratio in the "ASS7" index among splicing models in \Robject{ASdb}. \subsection{Finding SQTLs} Using "SplicingModel" and "Ratio" slots in \Robject{ASdb} from \Rfunction{Splicingfinder} and \Rfunction{RatioFromFPKM}, respectively, the \Rfunction{sQTLsFinder} function can identifies significant SNPs associated with alternative splicing rate (ratio). The result will be saved in the "sQTLs" slot of \Robject{ASdb} <>= ASdb <- sQTLsFinder(ASdb=ASdb,Total.snpdata=samplesnp, Total.snplocus=samplesnplocus,method="lm",Ncor=1) ASdb head(slot(ASdb,"sQTLs")$"ASS") @ In this example, we will run the function with the linear regression model. \Rfunction{sQTLsFinder} shows chromosome numbers during mapping analysis. \section{Identification of SQTLs using multiple cores} \Rfunction{Splicingfinder}, \Rfunction{RatioFromFPKM}, and \Rfunction{sQTLsFinder} functions provide to use multi-thread through \Rfunction{foreach} function. The last argument "Ncor" of the functions denotes the number of threads. <>= ASdb <- Splicingfinder(GTFdb=sample.Txdb,calGene=NULL,Ncor=4) ASdb <- RatioFromFPKM(GTFdb=sample.Txdb,ASdb=ASdb,Total.expdata=sampleexp,Ncor=4) ASdb <- sQTLsFinder(ASdb=ASdb,Total.snpdata=samplesnp, Total.snplocus=samplesnplocus,method="lm",Ncor = 4) ASdb @ \section{Visualizing the result} To visualize the results into boxplot, the \Rpackage{IVAS} package provides the \Rfunction{saveBplot} function. Using the data frame from the output of \Rfunction{sQTLsFinder} function, \Rfunction{saveBplot} can make the boxplot. <>= saveBplot(ASdb=ASdb,Total.snpdata=samplesnp,Total.snplocus=samplesnplocus, CalIndex="ASS7",out.dir="./result") @ \begin{center} \includegraphics[width=0.6\textwidth]{rs3810232_ASS7_0_lm.png} \end{center} The output png files are saved in "result" folder. \section{Session Information} <>= sessionInfo() @ \begin{thebibliography}{9} \bibitem{KeyanZhao} Keyan Zhao, Zhi-xiang Lu, Juw Won Park, Qing Zhou, Yi Xing. 2013. GLiMMPS: robust statistical model for regulatory variation of alternative splicing using RNA-seq data. Genome Biol 14, R74. \bibitem{JosephK} Joseph K. Pickrell, John C. Marioni, Athma A. Pai, Jacob F. Degner, Barbara E. Engelhardt, Everlyne Nkadori, Jean-Baptiste Veyrieras, Matthew Stephens, Yoav Gilad, Jonathan K. Pritchard. 2010. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768-722. \bibitem{Breslow} N.E. Breslow and D.G. Clayton. 1993. Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association 88 421: 9-25. \bibitem{Michael} Michael Lawrence, et al. 2013. Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol. 9(8): e1003118. \bibitem{Chambers} Chambers, J. M. 1992. Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth, and Brooks Cole. \bibitem{TuuliLappalainen} Tuuli Lappalainen, et al. 2013. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506-511. \end{thebibliography} \end{document}