--- title: "Using metagenomeFeatures to Retrieve Feature Data." author: - name: Nathan D. Olson affiliation: National Institute of Standards and Technology, Maryland, USA package: metagenomeFeatures output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Using metagenomeFeatures to Retrieve Feature Data.} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} references: - id: rousk2010soil title: Soil bacterial and fungal communities across a pH gradient in an arable soil. author: - family: Rousk given: Johannes - family: B{\aa}{\aa}th given: Erland - family: Brookes given: Philip C - family: Lauber given: Christian L - family: Lozupone given: Catherine - family: Caporaso given: J Gregory - family: Knight given: Rob - family: Fierer given: Noah container-title: The ISME journal volume: 4 issue: 10 publisher: Nature Publishing Group page: 1340 type: article-journal issued: year: 2010 --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE ) suppressPackageStartupMessages(library(metagenomeFeatures)) suppressPackageStartupMessages(library(phyloseq)) ``` # Vignette overview `r Biocpkg("metagenomeFeatures")` and associated annotation packages[^1] can be used to obtain phylogentic trees and representative sequences for 16S rRNA marker gene sequence experimental datasets when the reads are clustered using closed reference clustering and a database with an `MgDb` annotation package. In this vignette we demostrate how to use `metagenomeFeatures` and Greengenes 16S rRNA database version 13.8 85\% OTUs to obtain a phylogenetic tree and representative sequences for Rousk et al. [-@rousk2010soil], a soil microbiome study. We then use the tree and sequence data obtained from the Greengenes `MgDb` to generate a `phyloseq-class` object for community analysis[^2]. # Greengenes 13.8 85\% OTU MgDb The gg 13.8 85% OTU (`gg85`) is provided as part of the `r Biocpkg("metagenomeFeatures")` package. `gg85` is a `MgDb-class` object with the taxonomic heirarchy, sequence data, and phylogeny for the Greengenes database clustered at the 0.85 similarity threshold. After loading the `r Biocpkg("metagenomeFeatures")` package `gg85` is loaded using `get_gg13.8_85MgDb()`.[^3] ```{r message = FALSE} library(metagenomeFeatures) gg85 <- get_gg13.8_85MgDb() ``` ## QIITA Dataset [QIITA](https://qiita.ucsd.edu) is a public repository for sharing OTU tables and raw sequence data from 16S rRNA marker gene studies, however some of the datasets do not include representative sequences or phylogenetic trees. For this vignette we are using data from the previously mentioned soil microbiome study, https://qiita.ucsd.edu/study/description/94 [@rousk2010soil]. A BIOM and qiime mapping file for the study can be obtained from QIITA. A vector of Greengenes ids for the study cluster centers is included in this package for use in this vignette. ```{r} data(qiita_study_94_gg_ids) ``` ## Obtaining Representative Sequences and Phylogenetic Tree A `mgFeatures` object is generated from `gg85` (or any other `MgDb-class` object) using the `annotateFeatures()` function along with a set of database ids or keys. The resulting `mgFeatures` class object has the taxonomic heirarchy, phylogeny, and sequence data for the study OTUs.[^3] ```{r} soil_mgF <- annotateFeatures(gg85, qiita_study_94_gg_ids) ## Taxonomic heirarchy soil_mgF # Sequence data mgF_seq(soil_mgF) # Tree data mgF_tree(soil_mgF) ``` # Using `mgFeatures-class` object with `r Biocpkg("phyloseq")` Finally we use the phyogenetic tree from the `mgFeatures-class` object along with the OTU table for a beta diversity analysis. After loading the `r Biocpkg("phyloseq")` package we will load the abundance data from the QITTA biom file, using the `r Biocpkg("phyloseq")` command `import_biom()`. Next we will define the `sample_data` slot using a modified version of the mapping file `2301_mapping_file.txt` obtained from QITTA, which only includes sample names and experimental variables pH, carbon/nitrogen ratio, total nitrogen, and total organic carbon. Future development of a common data structure for working with microbiome data that extends the `SummarizeExperiment-class` is underway. For this new class the `mgFeatures-class` will define the `rowData` slot. ```{r} data_dir <- system.file("extdata", package = "metagenomeFeatures") ## Load Biom biom_file <- file.path(data_dir, "229_otu_table.biom") soil_ps <- phyloseq::import_biom(BIOMfilename = biom_file) ## Define sample data sample_file <- file.path(data_dir, "229_sample_data.tsv") sample_dat <- read.delim(sample_file) ## Rownames matching sample_names(), required for phyloseq sample_data slot rownames(sample_dat) <- sample_dat$SampleID sample_data(soil_ps) <- sample_dat ## Resulting phyloseq object soil_ps ``` Rousk et al. [-@rousk2010soil] clustered the sequences at 97\% similarity therefore `gg85` only contains a subset of the OTUs in the dataset. To define the tree and sequence slot we need to subset `soil_ps` to only include OTUs in `gg85`. Samples with no OTUs in `gg85` were also removed for our beta diversity analysis. ```{r} # Removing OTUs not in `gg85` soil_tree <- mgF_tree(soil_mgF) soil_ps_gg85 <- prune_taxa(taxa = soil_tree$tip.label, x = soil_ps) # Removing samples with no OTUs in `gg85` samples_to_keep <- sample_sums(soil_ps_gg85) != 0 soil_ps_gg85 <- prune_samples(samples = samples_to_keep, x = soil_ps_gg85) ``` Now `soil_mgF` can be used to to define our `soil_ps` object `refseq` and `tree` slots. ```{r} ## Defining tree slot phy_tree(physeq = soil_ps_gg85) <- soil_tree ## Defining seq slot soil_ps_gg85@refseq <- mgF_seq(soil_mgF) ``` Now that the `tree` slot is defined we can perform microbial community analysis requiring a phylogenetic tree such as phylogenetic beta diversity. For the subset of features used here, pH is a primary driver for differences in beta diversity between the samples (Figure \@ref(fig:betaFig)). Nearly 30\% of the total variation in the data is explained by the first axis and sample pH is correlated with the first axis. This result is consistent with Rousk et al. [-@rousk2010soil] even though we only used a small subset of the study OTUs. The outlier sample (94.PH18) is an artifact of using a small subset of the original dataset, only one `gg85` OTU is observed in the sample and that OTU was not observed in the other samples. ```{r betaFig, fig.cap="Beta diversity and ordination for a subset of features from Rousk et al. [-@rousk2010soil]. Beta diversity was estimated using Weighted Unifrac and Principal Component Analysis was used for ordination. Sampels are represented as individual point and color indicates soil sample pH."} soil_ord <- ordinate(physeq = soil_ps_gg85, distance = "wunifrac", method = "PCoA") plot_ordination(soil_ps_gg85, soil_ord, color = "ph", type="sample", label = "SampleID") ``` # Session info {.unnumbered} ```{r sessionInfo, echo=FALSE} sessionInfo() ``` [^1]: Other databases are avilable as [Bioconductor AnnotationData](https://www.bioconductor.org/packages/release/data/annotation/) including; Greengenes Release 13.5: `r Biocpkg("greengenesMgDb13.5")`, Greengenes Release 13.8 99% OTUs,`r Biocpkg("greengenes13.8_99MgDb")` Ribosomal Database Project Release 11.5, `r Biocpkg("ribosomaldatabaseproject11.5MgDb")`, and Silva 128.1: `r Biocpkg("silva128.1MgDb")`. [^2]: The `r Biocpkg("phyloseq")` package defines the `phyloseq-calss` for working with 16S rRNA experimental data. [^3]: See the "metagenomeFeatures classes and methods." vignette, `Vignettes("metagenomeFeatures classes and methods.")` for more information on methods for working with `MgDb-class` and `mgFeatures-class` objects. # Reference {.unnumbered}