%\VignetteIndexEntry{An Introduction to intansv} %\VignetteDepends{} %\VignetteKeywords{Structural variation} %\VignettePackage{intansv} \documentclass[10pt]{article} \textwidth=6.5in \textheight=8.5in %\parskip=.3cm \oddsidemargin=-.1in \evensidemargin=-.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\program}[1]{\textsf{#1}} \newcommand{\R}{\program{R}} \newcommand{\intansv}{\Rpackage{intansv}} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \bibliographystyle{plainnat} \title{An Introduction to \intansv{}} \author{Wen Yao} \date{\today} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents <>= options(width=72) @ \section{Introduction} Currently, dozens of programs have been developped to predict structural variations (SV) utilizing next-generation sequencing data. However, the prediction of multiple methods have to be integrated to get relatively reliable results \citep{Drosophila}. The \intansv{} package is designed for integrating results of different methods, annotating effects caused by SVs to genes and its elements, displaying the genomic distribution of SVs and visualizing SVs in specific genomic region. In this vignette, we will rely on a simple, illustrative example dataset to explain the usage of \intansv{}. The \intansv{} package is available from bioconductor <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("intansv") @ <>= library("intansv") @ \section{Usage of \intansv{} with example dataset} The example dataset was obtained by running seven different programs with the alignment results of a public available NGS dataset (NCBI SRA accession number SRA012177) to the rice reference genome (\url{http://rice.plantbiology.msu.edu/}). These seven programs including BreakDancer \citep{BreakDancer}, Pindel \citep{Pindel}, CNVnator \citep{CNVnator}, DELLY \citep{DELLY}, Lumpy \citep{Lumpy}, softSearch \citep{softSearch}, and SVseq2 \citep{svseq} were run with default parameters. The predicted SVs of chromosomes, \emph{chr05} and \emph{chr10}, were provided in the example dataset. \subsection{Read in predictions of different programs} Most of the predictions of programs are tedious and need to be formatted for further analysis. \intansv{} provides functions to read in the predictions of five popular programs, BreakDancer, Pindel, CNVnator, DELLY and SVseq2. During this process, the SV predictions with low quality would be filtered out and overlapped predictions would be resolved. We begin our discussion by reading in some example data. <>= breakdancer.file.path <- system.file("extdata/ZS97.breakdancer.sv", package="intansv") breakdancer.file.path breakdancer <- readBreakDancer(breakdancer.file.path) str(breakdancer) @ BreakDancer is able to predict deletions and inversions. The prediction of all type of SVs were provided as a single file by BreakDancer. You can feed this file to \Rfunction{readBreakdancer} and it will do all the tedious work for you. <>= cnvnator.dir.path <- system.file("extdata/cnvnator", package="intansv") cnvnator.dir.path cnvnator <- readCnvnator(cnvnator.dir.path) str(cnvnator) @ CNVnator is able to predict deletions and duplications. The final output of CNVnator usually contains several files and each file is the output for a specific chromosome. You should put all these files in the same directory and feed the path of this directory to \Rfunction{readCnvnator}. Then it will do all the jobs for you. However, the directory given to \Rfunction{readCnvnator} should only contain the final output files of CNVnator. See the example dataset for more details. <>= svseq.dir.path <- system.file("extdata/svseq2", package="intansv") svseq.dir.path svseq <- readSvseq(svseq.dir.path) str(svseq) @ SVseq2 is able to predict deletions. The final output of SVseq2 contains several files and each file is the output for a specific chromosome. What's more, different type of SVs were written to different files. You should put all these files in the same directory and feed the path of this directory to \Rfunction{readSvseq}. However, the files contain the predicted deletions in this directory should be named with suffix of \emph{.del}. See the example dataset for more details. <>= delly.file.path <- system.file("extdata/ZS97.DELLY.vcf", package="intansv") delly.file.path delly <- readDelly(delly.file.path) str(delly) @ DELLY is able to predict deletions, inversions and duplications. The prediction of all type of SVs were provided as a single VCF file by DELLY. You can feed this file to to \Rfunction{readDelly} and it will do all the tedious work for you. See the example dataset for more details. <>= pindel.dir.path <- system.file("extdata/pindel", package="intansv") pindel.dir.path pindel <- readPindel(pindel.dir.path) str(pindel) @ Pindel is able to predict deletions, inversions and duplications. The final prediction for different chromosomes given by Pindel were written to different files. And different type of SVs were written to different files. You should put all these files in the same directory and feed the path of this directory to \Rfunction{readPindel}. However, the files contain the predicted deletions, inversions and duplication in this directory should be named with suffix of \emph{\_D}, \emph{\_INV} and \emph{\_TD} respectively. See the example dataset for more details. <>= lumpy.file.path <- system.file("extdata/ZS97.LUMPY.vcf", package="intansv") lumpy.file.path lumpy <- readLumpy(lumpy.file.path) str(lumpy) @ Lumpy is able to predict deletions, inversions and duplications. The prediction of all type of SVs were provided as a single VCF file by Lumpy. You can feed this file to \Rfunction{readLumpy} and it will do all the tedious work for you. <>= softSearch.file.path <- system.file("extdata/ZS97.softsearch", package="intansv") softSearch.file.path softSearch <- readSoftSearch(softSearch.file.path) str(softSearch) @ softSearch is able to predict deletions, inversions and duplications. The prediction of all type of SVs were provided as a single file by softSearch. You can feed this file to \Rfunction{readSoftSearch} and it will do all the tedious work for you. The results of these seven programs have now been read into R and storaged as R data structure of lists with different compoents representing different types of SVs. We only care about three types of SVs, deletions, duplications and inversions. \subsection{Integrate predictions of different programs} To get more reliable results, we need to integrate the predictions of different programs. See our paper for more details on how \intansv{} integrate the predictions of different programs. <>= sv_all_methods <- methodsMerge(breakdancer, pindel, cnvnator, delly, svseq) str(sv_all_methods) @ The integrated results contain 897 deletions, 13 duplications and 12 inversions. However, it's not necessary for you to feed \intansv{} the output of all five programs. The following example shows the integration of four programs: BreakDancer, CNVnator, DELLY and SVseq2. <>= sv_all_methods.nopindel <- methodsMerge(breakdancer,cnvnator,delly,svseq) str(sv_all_methods.nopindel) @ \intansv{} is also able to integrate SV predictions of other programs. However, predictions of other programs should be provided as a data frame with six columns: \begin{itemize} \item{chromosome}{\qquad the chromosome ID of a SV} \item{pos1}{\qquad the start coordinate of a SV} \item{pos2}{\qquad the end coordinate of a SV} \item{size}{\qquad the size of a SV} \item{type}{\qquad the type of a SV} \item{methods}{\qquad the program used to get this SV prediction} \end{itemize} \subsection{Annotate the effects of SVs} To annotate the effects of SVs to genes, we need the genomic annotation file. To avoid confusion, this file should be a plain text file with 6 columns, the chromosome ID, tag of each genome element, start coordinate, end coordinate, strand, ID of each genome element. An example genomic annotation file has been storaged in the example dataset of \intansv{}. <>= anno.file.path <- system.file("extdata/chr05_chr10.anno.txt", package="intansv") anno.file.path msu_gff_v7 <- read.table(anno.file.path, head=TRUE, as.is=TRUE) head(msu_gff_v7) @ <>= sv_all_methods.anno <- llply(sv_all_methods,svAnnotation, genomeAnnotation=msu_gff_v7) names(sv_all_methods.anno) head(sv_all_methods.anno$del) @ Since the function \Rfunction{svAnnotation} only accept structural variations of the data frame format, we need to apply this function to each component of \emph{sv\_all\_methods}, which is a list. \subsection{Display the genomic distribution of SVs} Now, let's get a genomic view of SVs. We first split chromosomes into windows of 1 Mb and then display the number of SVs in each window as circular barplot. We also need the length of each chromosome, which was storaged in the example dataset of \intansv{}. % <>= genome.file.path <- system.file("extdata/chr05_chr10.genome.txt", package="intansv") genome.file.path genome <- read.table(genome.file.path, head=TRUE, as.is=TRUE) plotChromosome(genome, sv_all_methods,1000000) @ \begin{figure}[!htb] \begin{center} \includegraphics[width=0.8\textwidth]{intansvOverview-genomic_distribution} \caption{\label{genomic_distribution}% Visualization of SVs distribution in the whole genome. Blue: deletions, Red: duplications, Green: inversions.} \end{center} \end{figure} \subsection{Visualize SVs in specific genomic region} We could also visualize SVs in specific genomic region. Here, we also need the genomic annotation file. % <>= head(msu_gff_v7, n=3) plotRegion(sv_all_methods,msu_gff_v7, "chr05", 1, 200000) @ \begin{figure}[!htb] \begin{center} \includegraphics[width=0.7\textwidth]{intansvOverview-region_visualization} \caption{\label{region_visualization}% SVs in genomic region chr05:1-200000. Blue: genes, Red: deletions, Green: duplications, Purple: inversions. } \end{center} \end{figure} This command showed the SVs in the genomic region \emph{chr05:1-200000}. The genes and SVs were shown as circular rectangles with different color. \pagebreak[4] \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= sessionInfo() @ \bibliography{intansvOverview} \end{document}