---
title: "Beta Diversity"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Beta Diversity}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Input Matrix

Here we'll use the `ex_counts` feature table included with ecodive. It contains
the number of observations of each bacterial genera in each sample. In the text
below, you can substitute the word 'genera' for the feature of interest in your
own data.

```r
library(ecodive)

counts <- rarefy(ex_counts)

counts
#>                   Saliva Gums Nose Stool
#> Streptococcus        162  309    6     1
#> Bacteroides            2    2    0   341
#> Corynebacterium        0    0  171     1
#> Haemophilus          180   34    0     1
#> Propionibacterium      1    0   82     0
#> Staphylococcus         0    0   86     1
```


# Beta Diversity

Beta diversity is a measure of how different two samples are.

Looking at the `counts` matrix above, you can easily see that saliva and gums
are similar, while saliva and stool are different. The different metrics
described below quantify that difference, referred to as the "distance" or
"dissimilarity" between a pair of samples. The distance is `0` for identical
samples and `1` for completely different samples.


## Weighted vs Unweighted

The classic algorithms all run in weighted mode by default. Specifying `weighted
= FALSE`, e.g. `canberra(counts, weighted = FALSE)` will switch them to
unweighted mode.

* `bray_curtis()`, `canberra()`, `euclidean()`, `gower()`, `jaccard()`, `kulczynski()`, `manhattan()`

For the UniFrac algorithms, `unweighted_unifrac()` is unweighted and all the others are weighted.

* Unweighted: `unweighted_unifrac()`

* Weighted: `weighted_unifrac()`, `weighted_normalized_unifrac()`, `generalized_unifrac()`, `variance_adjusted_unifrac()`


# Partial Calculation

The default value of `pairs=NULL` in ecodive's beta diversity functions results
in the returned all-vs-all distance matrix being completely filled in.

```r
bray_curtis(counts)
#>          Saliva      Gums      Nose
#> Gums  0.4260870                    
#> Nose  0.9797101 0.9826087          
#> Stool 0.9884058 0.9884058 0.9913043
```

If you are doing a reference-vs-all comparison, you can use the `pairs`
parameter to skip unwanted calculations and save some CPU time. The larger the
dataset, the more noticeable the improvement will be.

```r
bray_curtis(counts, pairs = 1:3)
#>          Saliva      Gums      Nose
#> Gums  0.4260870                    
#> Nose  0.9797101        NA          
#> Stool 0.9884058        NA        NA
```

The `pairs` argument can be:

* A numeric vector, giving the positions in the result to calculate.
* A logical vector, indicating whether to calculate a position in the result.
* A `function(i,j)` that returns whether columns `i` and `j` should be compared.

Therefore, all of the following are equivalent:

```r
bray_curtis(counts, pairs = 1:3)
bray_curtis(counts, pairs = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE))
bray_curtis(counts, pairs = function (i, j) i == 1)
```


The ordering of `pairs` follows the pairings produced by `combn()`.

```r
# Column index pairings
combn(ncol(counts), 2)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    1    1    1    2    2    3
#> [2,]    2    3    4    3    4    4

# Sample name pairings
combn(colnames(counts), 2)
#>      [,1]     [,2]     [,3]     [,4]   [,5]    [,6]   
#> [1,] "Saliva" "Saliva" "Saliva" "Gums" "Gums"  "Nose" 
#> [2,] "Gums"   "Nose"   "Stool"  "Nose" "Stool" "Stool"
```

So, for instance, to use gums as the reference sample:

```r
my_combn <- combn(colnames(counts), 2)
my_pairs <- my_combn[1,] == 'Gums' | my_combn[2,] == 'Gums'

my_pairs
#> [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE

bray_curtis(counts, pairs = my_pairs)
#>          Saliva      Gums      Nose
#> Gums  0.4260870                    
#> Nose         NA 0.9826087          
#> Stool        NA 0.9884058        NA
```