--- title: "Beta Diversity" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Beta Diversity} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Input Matrix Here we'll use the `ex_counts` feature table included with ecodive. It contains the number of observations of each bacterial genera in each sample. In the text below, you can substitute the word 'genera' for the feature of interest in your own data. ```r library(ecodive) counts <- rarefy(ex_counts) counts #> Saliva Gums Nose Stool #> Streptococcus 162 309 6 1 #> Bacteroides 2 2 0 341 #> Corynebacterium 0 0 171 1 #> Haemophilus 180 34 0 1 #> Propionibacterium 1 0 82 0 #> Staphylococcus 0 0 86 1 ``` # Beta Diversity Beta diversity is a measure of how different two samples are. Looking at the `counts` matrix above, you can easily see that saliva and gums are similar, while saliva and stool are different. The different metrics described below quantify that difference, referred to as the "distance" or "dissimilarity" between a pair of samples. The distance is `0` for identical samples and `1` for completely different samples. ## Weighted vs Unweighted The classic algorithms all run in weighted mode by default. Specifying `weighted = FALSE`, e.g. `canberra(counts, weighted = FALSE)` will switch them to unweighted mode. * `bray_curtis()`, `canberra()`, `euclidean()`, `gower()`, `jaccard()`, `kulczynski()`, `manhattan()` For the UniFrac algorithms, `unweighted_unifrac()` is unweighted and all the others are weighted. * Unweighted: `unweighted_unifrac()` * Weighted: `weighted_unifrac()`, `weighted_normalized_unifrac()`, `generalized_unifrac()`, `variance_adjusted_unifrac()` # Partial Calculation The default value of `pairs=NULL` in ecodive's beta diversity functions results in the returned all-vs-all distance matrix being completely filled in. ```r bray_curtis(counts) #> Saliva Gums Nose #> Gums 0.4260870 #> Nose 0.9797101 0.9826087 #> Stool 0.9884058 0.9884058 0.9913043 ``` If you are doing a reference-vs-all comparison, you can use the `pairs` parameter to skip unwanted calculations and save some CPU time. The larger the dataset, the more noticeable the improvement will be. ```r bray_curtis(counts, pairs = 1:3) #> Saliva Gums Nose #> Gums 0.4260870 #> Nose 0.9797101 NA #> Stool 0.9884058 NA NA ``` The `pairs` argument can be: * A numeric vector, giving the positions in the result to calculate. * A logical vector, indicating whether to calculate a position in the result. * A `function(i,j)` that returns whether columns `i` and `j` should be compared. Therefore, all of the following are equivalent: ```r bray_curtis(counts, pairs = 1:3) bray_curtis(counts, pairs = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)) bray_curtis(counts, pairs = function (i, j) i == 1) ``` The ordering of `pairs` follows the pairings produced by `combn()`. ```r # Column index pairings combn(ncol(counts), 2) #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] 1 1 1 2 2 3 #> [2,] 2 3 4 3 4 4 # Sample name pairings combn(colnames(counts), 2) #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] "Saliva" "Saliva" "Saliva" "Gums" "Gums" "Nose" #> [2,] "Gums" "Nose" "Stool" "Nose" "Stool" "Stool" ``` So, for instance, to use gums as the reference sample: ```r my_combn <- combn(colnames(counts), 2) my_pairs <- my_combn[1,] == 'Gums' | my_combn[2,] == 'Gums' my_pairs #> [1] TRUE FALSE FALSE TRUE TRUE FALSE bray_curtis(counts, pairs = my_pairs) #> Saliva Gums Nose #> Gums 0.4260870 #> Nose NA 0.9826087 #> Stool NA 0.9884058 NA ```