%\VignetteIndexEntry{VariantFiltering: filter coding and non-coding genetic variants} %\VignetteKeywords{genetics, variants, annotation, filtering, inheritance} %\VignettePackage{VariantFiltering} \documentclass{article} <>= BiocStyle::latex(use.unsrturl=FALSE) @ \usepackage{natbib} \title{VariantFiltering: filtering of coding and non-coding genetic variants} \author{Dei M. Elurbe\,$^{1,2}$ and Robert Castelo\,$^{3,4}$} \begin{document} \SweaveOpts{eps=FALSE} \DefineVerbatimEnvironment{Sinput}{Verbatim} {formatcom = {\color{Sinput}}} \DefineVerbatimEnvironment{Soutput}{Verbatim} {formatcom = {\color{Soutput}}} \DefineVerbatimEnvironment{Scode}{Verbatim} {formatcom = {\color{Scode}}} \definecolor{Sinput}{rgb}{0.21,0.49,0.72} \definecolor{Soutput}{rgb}{0.32,0.32,0.32} \definecolor{Scode}{rgb}{0.75,0.19,0.19} <<>= pdf.options(useDingbats=FALSE) options(width=80) @ \maketitle \begin{quote} {\scriptsize \noindent $^{1}$CIBER de Enfermedades Raras (CIBERER), Barcelona, Spain \\ $^{2}$Present address: CMBI, Radboud University Medical Centre, Nijmegen, The Netherlands. \\ $^{2}$Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona, Spain. \\ $^{3}$Research Program on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute, Barcelona, Spain. } \end{quote} \section{Overview} The aim of this software package is to facilitate the filtering and annotation of coding and non-coding genetic variants from a group of unrelated individuals, or a family of related ones among which at least one of them is affected by a genetic disorder. When working with related individuals, \Biocpkg{VariantFiltering} can search for variants from the affected individuals that segregate according to a particular inheritance model acting on autosomes (dominant, recessive homozygous or recessive heterozygous -also known as compound heterozygous) or on allosomes (X-linked), or that occur \emph{de novo}. When working with unrelated individuals, no mode of inheritance is used for filtering but it can be used to search for variants shared among individuals affected by a common genetic disorder. The main input is a multisample Variant Call Format (VCF) file which is parsed using the R/Bioconductor infrastructure, and particularly the functionality from the \Biocpkg{VariantAnnotation} package for that purpose. This infrastructure also allows \Biocpkg{VariantFiltering} to stream over large VCF files to reduce the memory footprint and to annotate the input variants with diverse functional information. A core set of functional data are annotated by \Biocpkg{VariantFiltering} but this set can be modified and extended using Bioconductor annotation packages such as \Biocpkg{MafDb.1Kgenomes.phase3.hs37d5}, which stores and exposes to the user minor allele frequency (MAF) values frozen from the latest realease of the 1000 Genomes project. The package contains a toy data set to illustrate how it works through this vignette, and it consists of a multisample VCF file with variants from chromosomes 20, 21, 22 and allosomes X, Y from a trio of CEU individuals of the 1000 Genomes project. To further reduce the execution time of this vignette, only the code for the first analysis is actually evaluated and its results reported. \section{Setting up the analysis} To start using \Biocpkg{VariantFiltering} the user should consider installing the packages listed in the \Rcode{Suggests:} field from its \Rcode{DESCRIPTION} file. After loading \Biocpkg{VariantFiltering} the first step is to build a parameter object, of class \Rclass{VariantFilteringParam} which requires at least a character string with the input VCF filename, as follows: <<>>= library(VariantFiltering) CEUvcf <- file.path(system.file("extdata", package="VariantFiltering"), "CEUtrio.vcf.bgz") vfpar <- VariantFilteringParam(CEUvcf) class(vfpar) vfpar @ The display of the \Rclass{VariantFilteringParam} object indicates a number of default values which can be overriden when calling the construction function. To quickly see all the available arguments and their default values we should type: <<>>= args(VariantFilteringParam) @ The manual page of \Rfunction{VariantFilteringParam} contains more information about these arguments and their default values. \section{Annotating variants} After setting up the parameters object, the next step is to annotate variants. This can be done using upfront an inheritance model that will substantially filter and reduce the number of variants and annotations or, as we illustrate here below, calling the function \Rfunction{unrelatedIndividuals()} that just annotates the variants without filtering out any of them: <>= if (VariantFiltering:::.getOS() == "windows") { uind <- VariantFiltering:::.loadPrecomputedVariantFilteringResults() } else uind <- unrelatedIndividuals(vfpar) ## ~75 sec. @ <>= uind <- unrelatedIndividuals(vfpar) @ <<>>= class(uind) uind @ The resulting object belongs to the \Rclass{VariantFilteringResults} class of objects, defined within \Biocpkg{VariantFiltering}, whose purpose is to ease the task of filtering and prioritizing the annotated variants. The display of the object just tells us the genome version of the input VCF file, the number of individuals, the inheritance model and what variant filters are activated. To get a summary of the number of variants annotated to a particular feature we should use the function \Rfunction{summary()}: <<>>= summary(uind) @ The default setting of the \Rfunction{summary()} function is to provide feature annotations in terms of Sequence Ontology (SO) terms. The reported number of variants refer to the number of different variants in the input VCF file annotated to the feature while the percentage of variants refers to the fraction of this number over the total number of different variants in the input VCF file. We can also obtain a summary based on the Bioconductor feature annotations provided by the functions \Rfunction{locateVariants()} and \Rfunction{predictCoding()} from the \Biocpkg{VariantAnnotation} package, as follows: <<>>= summary(uind, method="bioc") @ Since SO terms are organized hierarchically, we can use this structure to aggregate feature annotations into more coarse-grained SO terms using the argument \Robject{method="SOfull"}: <<>>= uindSO <- summary(uind, method="SOfull") uindSO @ Here the \Robject{Level} column refers to the shortest-path distance to the most general SO term \Robject{sequence\_variant} within the SO ayclic digraph. We can use this level value to interrogate the annotations on a specific granularity: <<>>= uindSO[uindSO$Level == 6, ] @ Variants are stored internally in a \Rclass{VRanges} object. We can retrieve the variants as a \Rclass{Vranges} object with the function \Rfunction{allVariants()}: <<>>= allVariants(uind) @ This function in fact returns a \Rclass{VRangesList} object with one element per sample by default. We can change the grouping of variants with the argument \Robject{groupBy} specifying the annotation column we want to use to group variants. If the specified column does not exist, then it will return a single \Rclass{VRanges object} with all annotated variants. Using the following code we can obtain a graphical display of a variant, including the aligned reads and the running coverage, to have a visual representation of its support. For this purpose we need to have the BAM files used to perform the variant calling. In this case we are using toy BAM files stored along with the \Biocpkg{VariantFiltering} package, which for practical reasons only include a tiny subset of the aligned reads. <>= path2bams <- file.path(system.file("extdata", package="VariantFiltering"), paste0(samples(uind), ".subset.bam")) bv <- BamViews(bamPaths=path2bams, bamSamples=DataFrame(row.names=samples(uind))) bamFiles(uind) <- bv bamFiles(uind) plot(uind, what="rs6130959", sampleName="NA12892") @ \begin{figure}[ht] \centerline{\includegraphics[width=0.7\textwidth]{usingVariantFiltering-displayedVariant}} \caption{Browser-like display of a variant.} \label{displayedVariant} \end{figure} \section{Filters and cutoffs} In the case of having run the \Rfunction{unrelatedIndividuals()} annotation function we can filter variants by restricting the samples involved in the analysis, as follows: <<>>= samples(uind) samples(uind) <- c("NA12891", "NA12892") uind uindSO2sam <- summary(uind, method="SOfull") uindSO2sam[uindSO2sam$Level == 6, ] @ As we can see, restricting the samples for filtering variants results in fewer variants. We can set the original samples back with the function \Rfunction{resetSamples()}: <<>>= uind <- resetSamples(uind) uind @ The rest of the filtering operations we can perform on a \Rclass{VariantFilteringResults} objects are implemented through the \Rclass{FilterRules} class which implements a general mechanism for generating logical masks to filter vector-like objects; consult its manual page at the \Biocpkg{IRanges} package for full technical details. The \Biocpkg{Variantfiltering} package provides a number of default filters, which can be extended by the user. To see which are these filters we just have to use the \Rfunction{filters()} function: <<>>= filters(uind) @ Filters may be active or inactive. Only active filters will participate in the filtering process when we interrogate for variants. To know what filters are active we should use the \Rfunction{active()} function as follows: <<>>= active(filters(uind)) @ By default, all filters are always inactive. To activate all of them, we can simply type: <<>>= active(filters(uind)) <- TRUE active(filters(uind)) summary(uind) @ To deactivate all of them back and selectively activate one of them, we should use the bracket \Robject{[]} notation, as follows: <<>>= active(filters(uind)) <- FALSE active(filters(uind))["dbSNP"] <- TRUE summary(uind) @ The previous filter just selects variants with an annotated dbSNP identifier. However, other filters may require cutoff values to decide what variants pass the filter. To set those values we can use the function \Rfunction{cutoffs()}. For instance, in the case of the \Robject{SOterms} filter, we should use set the cutoff values to select variants annotated to specific SO terms. Here we select, for instance, those annotated in UTR regions: <<>>= change(cutoffs(uind), "SOterms") <- "UTR_variant" active(filters(uind))["SOterms"] <- TRUE summary(uind) summary(uind, method="SOfull") @ Note that the first call to \Rfunction{summary()} did not report any variant since there are no variants annotated to the SO term \Robject{UTR\_variant}. However, when using the argument \Robject{method="SOfull"}, all variants annotated to more specific SO terms in the hierarchy will be reported. The methods \Rfunction{filters()} and \Rfunction{cutoffs()} can be employed to extend the filtering functionality. Here we show a simple example in which we add a filter to detect the loss of the codon that initiates translation. This constitutes already a feature annotated by \Biocpkg{VariantFiltering} so that we can verify that it works: <<>>= startLost <- function(x) { mask <- start(allVariants(x, groupBy="nothing")$CDSLOC) == 1 & as.character(allVariants(x, groupBy="nothing")$REFCODON) == "ATG" & as.character(allVariants(x, groupBy="nothing")$VARCODON) != "ATG" mask } filters(uind)$startLost <- startLost active(filters(uind)) <- FALSE active(filters(uind))["startLost"] <- TRUE active(filters(uind)) summary(uind) @ As we can see, our filter works as expected and selects the only variant which was annotated with the SO term \Robject{start\_lost}. Note that there is also an additional annotation indicating this variant belongs to an UTR region resulting from an alternative CDS. Properly updating cutoff values may be problematic if we do not know how exactly are they employed by the corresponding filters. To facilitate setting the right cutoff values we use the \Rfunction{change()} method and we illustrate it on the filter of minor allele frequency (MAF) values: <<>>= active(filters(uind)) <- FALSE active(filters(uind))["KGp1"] <- TRUE cutoffs(uind)$KGp1 cutoffs(uind)$KGp1$popmask cutoffs(uind)$KGp1$value change(cutoffs(uind)$KGp1, "popmask") <- FALSE cutoffs(uind)$KGp1$popmask change(cutoffs(uind)$KGp1, "popmask") <- c(KGp1AF=TRUE) cutoffs(uind)$KGp1$popmask change(cutoffs(uind)$KGp1, "value") <- 0.01 summary(uind) @ In this case we selected variants with MAF$< 0.01$ in the global population of the 1000 Genomes project. If we are interested in retrieving the actual set of filtered variants, we can do it using the function \Rfunction{filteredVariants()}: <<>>= filteredVariants(uind) @ To further understand how to manipulate \Rclass{Vranges} and \Rclass{VRangesList} objects, please consult the package \Biocpkg{VariantAnnotation}. \section{Inheritance models} We can filter upfront variants that do not segregate according to a given inheritance model. In such a case, we also need to provide a PED file at the time we build the parameter object, as follows: <>= CEUped <- file.path(system.file("extdata", package="VariantFiltering"), "CEUtrio.ped") param <- VariantFilteringParam(vcfFilenames=CEUvcf, pedFilename=CEUped) @ Here we are using a PED file included in the \Biocpkg{VariantFiltering} package and specifying information about the CEU trio employed int his vignette. To use an inherintance model we need to replace the previous call to \Rfunction{unrelatedIndividuals()} by one specific to the inheritance model. The \Biocpkg{VariantFiltering} package offers 5 possible ones: \begin{itemize} \item \textbf{Autosomal recessive inheritance analysis: Homozygous.} Homozygous variants responsible for a recessive trait in the affected individuals can be identified calling the \Rfunction{autosomalRecessiveHomozygous()} function. This function selects homozygous variants that are present in all the affected individuals and occur in heterozygosity in the unaffected ones. \item \textbf{Autosomal recessive inheritance analysis: Heterozygous.} To filter by this mode of inheritance, also known as compound heterozygous, we need two unaffected parents/ancestors and at least one affected descendant. Variants are filtered in five steps: 1. select heterozygous variants in one of the parents and homozygous in the other; 2. discard previously selected variants that are common between the two parents; 3. group variants by gene; 4. select those genes, and the variants that occur within them, which have two or more variants and there is at least one from each parent; 5. from the previously selected variants, discard those that do not occur in the affected descendants. This is implemented in the function \Rfunction{autosomalRecessiveHeterozygous()}. \item \textbf{Autosomal dominant inheritance analysis.} The function \Rfunction{autosomalDominant()} identifies variants present in all the affected individual(s) discarding the ones that also occur in at least one of the unaffected subjects. \item \textbf{X-Linked inheritance analysis.} The function \Rfunction{xLinked()} identifies variants that appear only in the X chromosome of the unaffected females as heterozygous, don't appear in the unaffected males analyzed and finally are present (as homozygous) in the affected male(s). This function is currently restricted to affected males, and therefore, it cannot search for X-linked segregating variants affecting daughters. \item \textbf{De Novo variants analysis} The function \Rfunction{deNovo()} searches for \emph{de novo} variants which are present in one descendant and present in both parents/ancestors. It is currently restricted to a trio of individuals. \end{itemize} \section{Create a report from the filtered variants} The function \Rfunction{reportVariants()} allows us to easily create a report from the filtered variants into a CSV or a TSV file as follows: <>= reportVariants(uind, type="csv", file="uind.csv") @ The default value on the \Robject{type} argument (\Robject{"shiny"}) starts a shiny web app which allows one to interactively filter the variants, obtaining an udpated \Rclass{VariantFilteringResults} object and downloading the filtered variants and the corresponding full reproducible R code, if necessary. \begin{figure}[ht] \centerline{\includegraphics[width=\textwidth]{shinysnapshotVariantFiltering.jpg}} \caption{Snapshot of the shiny web app run from VariantFiltering with the function \Rfunction{reportVariants()}. Some of the parameters has been filled for illustrative purposes.} \label{shinyapp} \end{figure} \section{Using the package with parallel execution} Functions in \Biocpkg{VariantFiltering} to annotate and filter variants leverage the functionality of the Bioconductor package \Biocpkg{BiocParallel} to perform in parallel some of the tasks and calculations and reduce the overall execution time. These functions have an argument called \Robject{BPPARAM} that allows the user to control how this parallelism is exploited. In particular the user must give as value to this argument the result from a call to the function \Rfunction{bpparam()}, which actually is its default behavior. Here below we modify that behavior to force a call being executed without parallelism. The interested reader should consult the help page of \Rfunction{bpparam()} and the vignette of the \Biocpkg{BiocParallel} for further information. \section{Session information} <>= toLatex(sessionInfo()) @ \end{document}