\title{Taxonomic classification using \shellcommand{pplacer}, \Rpackage{clst}, and \Rpackage{clstutils}} \author{Noah Hoffman} \maketitle \tableofcontents \section{Introduction} This vignette assumes that you have already created a reference package, and have used it to run \shellcommand{pplacer} against an alignment of reference sequences. For instruction on performing the above operations, see the documentation for \shellcommand{pplacer} here: \url{http://matsen.fhcrc.org/pplacer/} \section{Input files} The following input is required (file names defined as variables in parentheses): \begin{enumerate} \item a reference package (\code{refpkg}) \item a pplacer file created using the same reference package (\code{placefile}) \item a file providing distances between nodes in the reference tree (\code{distfile}) \end{enumerate} Note that \code{distfile} is generated from \code{placefile} using \shellcommand{placeutil} (distributed with \shellcommand{pplacer}). <>= library(clstutils) expand <- function(fname){ orig.dir <- getwd() destdir <- tempdir() setwd(destdir) archive <- system.file('extdata','vaginal_16s.refpkg.tar.gz', package='clstutils') system(sprintf('tar --no-same-owner -xzf "%s"', archive)) setwd(orig.dir) file.path(destdir, fname) } refpkg <- expand('vaginal_16s.refpkg') placefile <- system.file('extdata','merged.json', package='clstutils') distfile <- system.file('extdata','merged.distmat.bz2', package='clstutils') @ \section{Reading the input} Classification requires a matrix representation of distances between ``objects'' being classified, in this case sequences in a phylogenetic tree. \Rfunction{treeDists} returns a list containing matrix representations of distances between internal and terminal edges (\code{\$dists} and \code{\$paths}), and \code{\$dmat}, a square matrix of distances between terminal edges. <>= treedists <- treeDists(distfile=distfile, placefile=placefile) @ We also need a description of the taxonomy of the reference sequences. This is read from the reference package using \Rfunction{taxonomyFromRefpkg}. The \Rfunarg{seqnames} argument ensures that the output is arranged in an order compatible with \code{treedists}. We indicate that the most specific rank that we want to consider is ``species'' using \Rfunarg{lowest\_rank}. <>= taxdata <- taxonomyFromRefpkg(refpkg, seqnames=rownames(treedists$dmat), lowest_rank='species') @ \section{Classification of a single sequence} Given the distances and taxonomic information describing the reference tree, the only additional data required to perform classification is the position of a sequence placed onto a tree. At a minimum, this consists of a data.frame with columns \code{at}, \code{edge}, and \code{branch}. This data will be used to generate a vector of branch lengths between the query sequence and each of the reference sequences on the tree. <<>>= placetab <- data.frame(at=49, edge=5.14909e-07, branch=5.14909e-07) @ The function \Rfunction{classifyPlacements} is a wrapper around \Rpackage{clst}::\Rfunction{classifyIter}. The output is a \Robject{data.frame} describing the taxonomic assignment, along with a description of the confidence of the classification. See the man page for \Rpackage{clst}::\Rfunction{classify} for details on the output. <<>>= cdata <- classifyPlacements(taxdata, treedists, placetab) cdata @ % \section{Classification of multiple sequences} % Placement information for multiple sequences may be contained in a % placefile. This information can be read directly from a single % placefile using \Rfunction{placeData}. % <<>>= % placetab <- placeData(placefile) % head(placetab) % @ % Note that pplacer also provides tax\_id assignments using a different % method. Also, a list of possible placements are provided for each % sequence accompanied by the probability of each. We will use only the % first placement (ie, the top ``hit'') for each sequence. % <<>>= % tophits <- subset(placetab, hit==1) % set.seed(125) % cdata <- classifyPlacements(taxdata, treedists, % tophits[sample(nrow(tophits),10),]) % cdata % @ \end{document}