--- title: "wingen-vignette" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{wingen-vignette} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 72 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, warning = FALSE, message = FALSE} library(wingen) library(terra) library(raster) library(viridis) library(sf) library(ggplot2) ``` ## Background wingen uses a moving window approach to create maps of genetic diversity. The method and its rationale are described in [Bishop et al. (2023)](http://doi.org/10.1111/2041-210X.14090) and on the [Methods blog](https://methodsblog.com/2023/05/03/wingen-mapping-genetic-diversity-using-moving-windows/). ## Example To demonstrate how wingen works, we will use a subset of the data from the [Bishop et al. (2023)](http://doi.org/10.1111/2041-210X.14090) simulation example. These simulations were created using Geonomics ([Terasaki Hart et al., 2022](https://doi.org/10.1093/molbev/msab175)) to generate a realistic landscape genomic dataset. In this simulation, spatial variation in genetic diversity is produced by varying population size and gene flow across the landscape via heterogeneous carrying capacity and conductance surfaces. These surfaces are based on an example digital elevation model of Tolkien's Middle Earth produced by the Center for Geospatial Analysis at William & Mary ([Robert, 2020](https://scholarworks.wm.edu/asoer/3/)). ### Load Middle Earth example The small Middle Earth example dataset used here contains four objects which are loaded by `load_middle_earth_ex()`: 1. `lotr_vcf` - a vcfR object containing the genetic data 2. `lotr_coords` - a dataframe object containing sample coordinates 3. `lotr_lyr` - a raster object of the landscape (higher values indicate greater connectivity/carrying capacity) 4. `lotr_range` - a polygon outlining the "range" of the simulated species ```{r load example data, fig.width = 5, fig.height = 5} load_middle_earth_ex() # Genetic data lotr_vcf # Coordinates head(lotr_coords) # Raster data lotr_lyr # Map of data plot(lotr_lyr, col = magma(100), axes = FALSE, box = FALSE) points(lotr_coords, col = mako(1, begin = 0.8), pch = 3, cex = 0.5) ``` If users don't have a raster layer of their landscape, they can generate one using their sample coordinates with the `coords_to_raster()` function. The resolution of this raster can be tuned either with the `agg` (to aggregate) or `disagg` (to disaggregate) arguments, or defined using the `res` argument. The `res` argument can either be a single value (e.g., 0.00833) or a vector of two values with the x and y resolutions. The `buffer` argument can be used to add an edge to the raster (i.e., buffer away from the coordinates). ```{r, fig.width = 5, fig.height = 5} ex_raster1 <- coords_to_raster(lotr_coords, buffer = 1, plot = TRUE) ex_raster2 <- coords_to_raster(lotr_coords, buffer = 1, agg = 2, plot = TRUE) ex_raster3 <- coords_to_raster(lotr_coords, buffer = 1, disagg = 4, plot = TRUE) ex_raster4 <- coords_to_raster(lotr_coords, buffer = 1, res = 10, plot = TRUE) ``` ## Workflow The workflow of wingen uses three main functions: 1. `window_gd()`, `circle_gd()`, or `resist_gd()` to generate moving window maps of genetic diversity 2. `wkrig_gd()` to use kriging to interpolate the moving window maps 3. `mask_gd()` to mask areas of the maps from (1) and (2) (e.g., to exclude areas outside the study region) ## Different moving window calculations There are three functions that can be used to generate moving window maps: 1. `window_gd()` uses a simple rectangular window with user-specified dimensions; this is the original method described in Bishop et al. 2023 2. `circle_gd()` uses a circular window with a user-specified radius 3. `resist_gd()` uses a resistance distance-based window with a user-specified maximum distance (this can be thought of as a circular window with a radius that varies depending on the resistance of the landscape around it) The main arguments that all three functions have in common are: 1. `gen` - An object of type vcfR or a path to a vcf file with genotype data. The order of this file matters! The coordinate and genetic data should be in the same order, as there are currently no checks for this. 2. `coords` - sf points, or a matrix or dataframe with two columns representing the coordinates of the samples. The first column should be x and the second should be y. 3. `lyr` - A SpatRaster or RasterLayer which the window will move across to create the final map. In most cases, this will take the form of a raster of the study area. 4. `stat` - The genetic diversity summary statistic to calculate. Current genetic diversity metrics that can be specified with `stat` include: - `"pi"` for nucleotide diversity (default) calculated using `hierfstat::pi.dosage()` (only works for bi-allelic data). Use the `L` argument to set the sequence length for the pi calculation. If the sequence length is unknown, `L` can be set to "nvariants" (default) which sets `L` to the number of variants; however, note that this is not strictly the correct calculation of pi because it does not account for invariant sites. If `L` is set to NULL, the sum over SNPs of nucleotide diversity is returned (*note:* `L = NULL` is the \link[hierfstat]{pi.dosage}). Otherwise, `L` can be set to any numeric value which will serve as the denominator of the pi calculation. - `"Ho"` for average observed heterozygosity across all sites - `"allelic_richness"` for average number of alleles across all sites - `"biallelic_richness"` for average allelic richness across all sites for a biallelic dataset (this option is faster than `"allelic_richness"`). When calculating `“biallelic_richness”`, users can choose to rarify allele counts (as in `hierfstat::allelic.richness()`) by setting `rarify_alleles = TRUE` (the default) or to use the raw allele counts by setting `rarify_alleles = FALSE`. We recommend performing allele count rarefaction (`rarify_alleles = TRUE`) if there are missing values in the genetic data, but for datasets with no missing data, it is faster to use the raw counts (`rarify_alleles = FALSE`) - `"hwe"` for the proportion of sites that are not in Hardy-Weinberg equilibrium, calculated using `pegas` \link[pegas]{hw.test} at the 0.05 level (other alpha levels can be specified by adding the `sig` argument; e.g., `sig = 0.10`) - `"basic_stats"` for a series of statistics produced by `hierfstat::basic.stats()` including mean observed heterozygosity (same as Ho), mean gene diversities within population (Hs), Gene diversities overall (Ht), and Fis following Nei (1987). Population-based statistics (e.g., FST) normally reported by `hierfstat::basic.stats()` are not included as they are not meaningful within the individual-based moving windows 5. `fact` - To decrease computational time, we provide the option to aggregate the input raster layer by some factor defined using the `fact` argument. Increasing `fact` will decrease the number of cells and thereby decrease the number of calculations, with the trade-off that the resolution of the output layers will decrease. Users should keep in mind that if they increase `fact`, they may simultaneously want to decrease `wdim` because the proportion of the landscape covered by the neighborhood matrix could otherwise increase substantially. 6. `rarify` - Users have the option to perform rarefaction by setting the `rarify` argument to `TRUE.` If `rarify = TRUE`, users then define `rarify_n` as the number of samples to rarify to and `rarify_nit` as the number of iterations for rarefaction (e.g., if `rarify_n = 4` and `rarify_nit = 5`, for each sample set, four random samples will be drawn five times). Users can also set `rarify_nit = "all"`to use all possible combinations of samples of size `rarify_n` within the window (for example, if `rarify_n = 4` and the number of samples in the window is 5, all 20 possible combinations of samples will be used). As the window moves across the landscape, three things can occur based on the number of samples in the window: (1) if the number of samples is lower than `rarify_n`, genetic diversity is not calculated and a raster value of NA is assigned, (2) if the number of samples is equal to `rarify_n`, the genetic diversity statistic is calculated for those samples, (3) if the number of samples is greater than `rarify_n`, rarefaction is implemented and that set of samples is subsampled `rarify_nit` times to a size of `rarify_n` and the mean (or another summary statistic set using `fun`) of those `rarify_nit` iterations is used. Note that if the number of samples is lower than `min_n`, a raster value of NA will always be assigned, even if the number of samples is not less than `rarify_n` (i.e., if `min = 2` and `rarify_n = 1` and the sample size is 1 in the window, then an NA value will be assigned; to change this behavior so that genetic diversity is calculated, set `min_n = 1`). If `rarify = FALSE`, rarefaction is not performed and only steps (1) and (2) from above occur such that: (1) if the number of samples in the window is less than the `min_n` argument, genetic diversity is not calculated and a raster value of NA is assigned, and (2) if the number of samples is equal to or greater than `min_n`, the genetic diversity statistic is calculated for those samples. We highly encourage users to perform rarefaction as genetic diversity statistics are sensitive to sample size. The main benefit of not performing rarefaction is decreased computational time; however, this is not worth the trade-off in inaccuracy unless you are confident that there is no effect of rarefaction after performing your analysis with and without rarefaction. ------------------------------------------------------------------------ *Note:* Coordinates and rasters used in wingen should be in a projected (planar) coordinate system such that raster cells are of equal sizes. Therefore, spherical systems (including latitude-longitude coordinate systems) should be projected prior to use. An example of how this can be accomplished is shown below. If no CRS is provided, a warning will be given and wingen will assume the data are provided in a projected system. ```{r} # First, we create example latitude-longitude coordinates and rasters ## Example raster: lyr_longlat <- rast( ncols = 40, nrows = 40, xmin = -110, xmax = -90, ymin = 40, ymax = 60, crs = "+proj=longlat +datum=WGS84" ) ## Example coordinates: coords_df <- data.frame(x = c(-110, -90), y = c(40, 60)) coords_longlat <- st_as_sf(coords_df, coords = c("x", "y"), crs = "+proj=longlat") # Next, the coordinates and raster can be projected to an equal area projection, in this case the Goode Homolosine projection (https://proj.org/operations/projections/goode.html): coords_eq <- st_transform(coords_longlat, crs = "+proj=goode") lyr_eq <- project(lyr_longlat, "+proj=goode") # Coordinates can be switched back to latitude-longitude the same way by replacing "goode" with "longlat" ``` ### Run moving window calculations with `window_gd()` To create a moving window map with a rectangular window, use `window_gd()`. The main **additional** arguments to `window_gd()` are: 1. `wdim` - Used to create the neighborhood matrix for the moving window based on the dimensions provided. This argument can either be set to one value (e.g., 3) which will create a square window (e.g., 3 x 3), or two values, which will create a rectangular window (e.g., 3 x 5). We encourage users to experiment with different values of `wdim` to determine the sensitivity of their results to this parameter. Ideally, `wdim` would be set with some knowledge of the study system in mind (e.g., the dispersal patterns and/or neighborhood size of the study organism). A preview of the window size relative to the landscape can be obtained using the `preview_gd()` function. 2. `crop_edges` - Whether to crop out the cells along the edge of the raster that are not surrounded by a full window. Users may want to do so to avoid "edge effects" caused by incomplete windows along the borders of the raster. As wingen is relatively insensitive to sample size, edge effects are not likely to have a very strong effect on diversity estimates, so by default `crop_edges = FALSE`. Before running `window_gd()`, users can preview the moving window and the counts within each cell of the raster to get a sense of how big the window is and what the density of counts looks like across their landscape. Here, we provide the raster layer (`lotr_lyr`), the coordinates (`lotr_coords`), the window dimensions (7 x 7), the aggregation factor (3), and the minimum sample number (`min_n`). `min_n` will be used to mask the sample count layer to show how much of the landscape will be excluded due to low sample count. We specify `method = window` for a preview of `window_gd()`, we will discuss other methods (i.e., `circle` and `resist` later). ```{r, fig.width = 6, fig.height = 5, cache = TRUE, warning = FALSE} preview_gd(lotr_lyr, lotr_coords, method = "window", wdim = 7, fact = 3, sample_count = TRUE, min_n = 2 ) ``` Next, we run the moving window function with our vcf, coordinates, and raster layer. Here, we set the parameters to calculate pi, using a window size of 7 x 7, an aggregation factor of 3, and rarefaction with a rarefaction size of 2 (i.e., minimum sample size of 2), and 5 iterations. We then plot the genetic diversity layer (the first layer of the produced raster stack) and the sample counts layer (the second layer). *Note:* you will get a warning that no CRS is found for the provided coordinates or raster and to check that the CRS for these objects match. This is because these simulated data don't have a CRS, but we know they match, so we can ignore this warning. ```{r moving window, fig.width = 5, fig.height = 5, cache = TRUE, warning = TRUE, message = FALSE} wgd <- window_gd(lotr_vcf, lotr_coords, lotr_lyr, stat = "pi", wdim = 7, fact = 3, rarify = TRUE, rarify_n = 2, rarify_nit = 5, L = 100 ) # The ggplot_gd function plots the genetic diversity layer ggplot_gd(wgd, bkg = lotr_range) + ggtitle("Moving window pi") + # Add sampling coordinates geom_point(data = lotr_coords, aes(x = x, y = y), pch = 16) ``` ```{r, fig.height = 5, fig.width = 5.5} # The plot_count function plots the sample count layer ggplot_count(wgd) + ggtitle("Moving window sample counts") ``` You can also create base R plots of your results using `plot_gd()` and `plot_count()`: ```{r, fig.width = 5, fig.height = 5} plot_gd(wgd, bkg = lotr_range, main = "Moving window pi") plot_count(wgd, main = "Moving window sample counts") ``` ### Run moving window calculations with `circle_gd()` To create a moving window map with a circular window, we can use `circle_gd()`. The main **additional** arguments to `circle_gd()` are: 1. `maxdist` - maximum geographic distance used to define the neighborhood; any samples farther than this distance will not be included (this can be thought of as the neighborhood radius). Can be provided as either (1) a single numeric value or (2) a SpatRaster where each pixel is the maximum distance to be used for that cell on the landscape (must be the same spatial scale as `lyr`). As with `wdim`, we encourage users to experiment with different values of `maxdist` to determine the sensitivity of their results to this parameter and ideally set it with some knowledge of the study system in mind (e.g., the dispersal patterns and/or neighborhood size of the study organism). 2. `distmat` - optional distance matrix output from `get_geodist()`; if not provided, `circle_gd()` will automatically use the `get_geodist()` function to calculate the distance matrix instead. Like before, we can use the `preview_gd()` function to preview parameter settings. This time we specify `method = "circle"` and need to provide `maxdist`. Before doing so, we can calculate the distance matrix using `get_geodist()` which we can then use in both `preview_gd()` and `circle_gd()` to save time. `get_geodist()` produces a distance matrix where the rows represent the cells of the raster and the columns are the individuals. The values are the geographic distance between each cell and each individual. Note that the input layer for `get_geodist()` must have the same resolution as the input layer for `preview_gd()` and `circle_gd()`. For example, if you want to run `circle_gd()` with `fact = 3` and `lyr = lotr_lyr`, the input layer to `get_geodist()` must be `aggregate(lotr_lyr, 3)`. Alternatively, you can first create a new aggregated raster to be used for all functions, as we have done below. *Note:* if you want to get only the distances between your sample coordinates, set `coords_only = TRUE`. ```{r, fig.width = 5, fig.height = 5, cache = TRUE, warning = FALSE} # First, create a new aggregated raster lotr_lyr5 <- aggregate(lotr_lyr, 5) # Then calculate the distance matrix distmat <- get_geodist(coords = lotr_coords, lyr = lotr_lyr5) preview_gd(lotr_lyr5, lotr_coords, method = "circle", maxdist = 10, distmat = distmat, sample_count = TRUE, min_n = 2 ) ``` Then, we can run `circle_gd()`: ```{r circle, fig.width = 5, fig.height = 5, cache = TRUE, warning = FALSE, message = FALSE} cgd <- circle_gd(lotr_vcf, lotr_coords, lotr_lyr5, stat = "pi", maxdist = 10, distmat = distmat, rarify = FALSE, L = 100 ) ggplot_gd(cgd, bkg = lotr_range) + ggtitle("Circle moving window pi") ``` For `circle_gd()` (and `resist_gd()`), you can also define `maxdist` with a raster to set the maximum distance (i.e., the radius) for each cell. In this case, we can use the carrying capacity/conductance map (`lotr_lyr5`) and multiply it by 100 so that the radius will range from 0 to 100 based on the value of the raster. ```{r variable circle, fig.width = 5, fig.height = 5, cache = TRUE, warning = FALSE, message = FALSE} vcgd <- circle_gd( lotr_vcf, lotr_coords, lotr_lyr5, stat = "pi", maxdist = lotr_lyr5 * 100, distmat = distmat, rarify = FALSE, L = 100 ) ggplot_gd(vcgd, bkg = lotr_range) + ggtitle("Circle moving window pi") ``` ### Run moving window calculations with `resist_gd()` To create a moving window map with resistance distances, use `resist_gd()`. The main **additional** arguments to `resist_gd()` are: 1. `maxdist` - maximum cost distance used to define the neighborhood; any samples further than this cost distance will not be included (this can be thought of as the neighborhood radius, but in terms of cost distance). Can be provided as either (1) a single numeric value or (2) a SpatRaster where each pixel is the maximum distance to be used for that cell on the landscape (must be the same spatial scale as `lyr`). As stated before, we encourage users to experiment with different values of `maxdist` to determine the sensitivity of their results to this parameter. 2. `lyr` - in addition to being the raster layer the window moves across, this will also serve as the conductivity layer used to calculate resistance distances (higher values should mean greater conductivity). 3. `distmat` - optional distance matrix output from `get_resdist()`; if not provided, `resist_gd()` will automatically use the `get_resdist()` function to calculate the distance matrix instead. `resist_gd()` takes a longer time to run and we recommend taking advantage of parallelization, in particular for the calculation of the distance matrix using `get_resdist()`. We also recommend first calculating the distance matrix with `get_resdist()` and then providing it to `resist_gd()` using the `distmat` argument as this allows for `resist_gd()` to be run quickly multiple times (i.e., to test different parameters) by avoiding repeating the costly distance calculations. Note that `lyr` in both `get_resdist()` and in `resist_gd()` must have the same spatial scale (e.g., the same resolution and extent). As with `get_geodist()`, `get_resdist()` produces a distance matrix where the rows represent the cells of the raster and the columns are the individuals. The values are the resistance distance between each cell and each individual. If you want to get only the distances between your sample coordinates, set `coords_only = TRUE`. Again, we can use the `preview_gd()` function to preview parameter settings. Here, we specify `method = "resist"` and provide `maxdist`. ```{r, cache = TRUE, warning = FALSE, message = FALSE} lotr_distmat <- get_resdist(coords = lotr_coords, lyr = lotr_lyr5) ``` ```{r, fig.width = 6.5, fig.height = 5, cache = TRUE, warning = FALSE, message = FALSE} preview_gd(lotr_lyr5, lotr_coords, method = "resist", maxdist = 60, distmat = lotr_distmat, sample_count = TRUE, min_n = 2 ) ``` ```{r resist, fig.width = 5, fig.height = 5, cache = TRUE, warning = FALSE, message = FALSE} rgd <- resist_gd(lotr_vcf, lotr_coords, lotr_lyr5, distmat = lotr_distmat, stat = "pi", maxdist = 60, rarify = FALSE, L = 100 ) ggplot_gd(rgd, bkg = lotr_range) + ggtitle("Resistance moving window pi") ``` In addition to using continuous rasters of connectivity, you can also use rasters to designate passable and impassable areas. This can be useful for designating boundaries you don't want windows to extend across (e.g., mountain ranges) or extend between (e.g., rivers). ```{r, fig.width = 6.5, fig.height = 5, cache = TRUE, warning = FALSE} # create layer resist_lyr <- rast(aggregate(lotr_lyr, 5)) # create an impassable area down the middle coded as NA resist_lyr[] <- 1 resist_lyr[ext(40, 60, -100, 0)] <- NA # get resistance distances distmat <- get_resdist(coords = lotr_coords, lyr = resist_lyr) # plot preview preview_gd(resist_lyr, lotr_coords, method = "resist", distmat = distmat, maxdist = 10) # create moving window map resist_gd <- resist_gd( lotr_vcf, lotr_coords, resist_lyr, maxdist = 10, distmat = distmat ) ggplot_gd(resist_gd) ``` ## Transforming the moving window maps ### Krige results To produce smoother maps of genetic diversity, we provide the function `wkrig_gd()` which creates a spatially interpolated raster from the moving window raster produced by the moving window functions. This function uses the gstat package to fit different variogram models and select the best one to perform kriging. Kriging is performed by first transforming the moving window layer into a set of coordinates with corresponding genetic diversity (or sample count) values and then interpolating using these coordinates across the grid provided. Because of this, it is important to keep in mind how the coordinates from the moving window raster and the grid align. If the resolution of the moving window raster is much lower than that of the grid, there are fewer points for interpolation which can result in grid-like artifacts during kriging. To deal with this issue, users can either (1) resample their moving window raster layers and grid layers to the same resolution using the `resample` argument, or (2) manually disaggregate or aggregate either the moving window or grid layers using the terra `aggregate()` or `disagg()` functions. Generally, if users want a smoother resulting surface, a higher resolution grid layer should be used. Keep in mind that increasing the resolution of the moving window layer (either by resampling or disaggregating) can increase computational time as this increases the number of coordinates being used for kriging. This is also the case for increasing the resolution of the grid layer, though to a lesser extent. To run this function, we provide the moving window raster layer and a raster layer to interpolate across. If no raster layer for interpolation is provided, the moving window layer will be used (i.e., the output kriged layer will have the same resolution and extent of the input window layer). If a stack of rasters is provided, only the first layer will be used. The output of this function is a kriged raster layer. Optionally, users can supply the moving window sample counts layer with the `weight_r` argument to set weights for variogram fitting and kriging, giving higher influence to cells with more samples. Users may also want to test different values for `nmax`, which is the maximum number of points to use for kriging; here, we use `nmax = 30` to balance speed and accuracy. ```{r krige results, fig.width = 5, fig.height = 5, cache = TRUE, warning = FALSE} # First, let's make a layer to krige across # Let's say I want to krige to twice the resolution of the moving window layer # I can use the disagg() function to disaggregate the moving window layer grd <- disagg(wgd, 2) # Then, I can use this layer as the grid to krige across kgd <- wkrig_gd(wgd[["pi"]], grd = grd, weight_r = wgd[["sample_count"]], nmax = 30) ggplot_gd(kgd) + ggtitle("Kriged pi") ``` Users can optionally get the full model output by setting `model_output = TRUE` in which case `wkrig_gd()` returns a list where the first element ("raster") is the kriged raster, the second element ("variogram") is the variogram, and the third element ("model") is the best-fit variogram model. *Note:* Kriging can produce values that fall outside the range of possible values (e.g., genetic diversity values less than 0 or greater than the possible maximum). By default, in order to ensure that the values are bounded within the range of possible values, `wkrig_gd()` converts all values greater than the maximum of the input raster to that maximum (`upper_bound = TRUE`), and all values less than the minimum of the input raster to that minimum (`lower_bound = TRUE`). Users can turn off this functionality by setting `lower_bound` and `upper_bound` to `FALSE`. Users can also set their own custom `lower_bound` and `upper_bound` values (e.g., for heterozygosity or pi, you may want to set `lower_bound = 0` and `upper_bound = 1`). **Legacy method: krig_gd()** Users should note that `wkrig_gd()` superseded `krig_gd()`. This legacy function used the `autoKrige()` function from the R package automap to perform kriging using an automatically generated variogram. We have found that `wkrig_gd()` provides improved performance for interpolation of the moving window raster and we recommend using it instead of `krig_gd()`. ### Mask results Next, we mask the resulting kriged layers. Masking can be performed using a variety of methods. **Method 1:** Mask using the carrying capacity layer to exclude any areas where the carrying capacity is lower than 0.1. Alternatively, one could use a species distribution model or habitat suitability model to exclude areas where the probability of presence is very low: ```{r mask results 1, fig.width = 5.5, fig.height = 5, warning = FALSE} # Aggregate the lotr_lyr to make it the same resolution as kgd before masking # Note: lotr_lyr is a RasterLayer, so we convert it to a SpatRaster mask_lyr <- resample(rast(lotr_lyr), kgd) mgd <- mask_gd(kgd, mask_lyr, minval = 0.1) ggplot_gd(mgd) + ggtitle("Kriged & carrying capacity masked pi") ``` **Method 2:** Mask the layer using a species range map (in this case, an sf polygon) to exclude areas falling outside the species range. ```{r mask results 2, fig.width = 5.5, fig.height = 5, warning = FALSE} mgd <- mask_gd(kgd, lotr_range) ggplot_gd(mgd) + ggtitle("Kriged & range masked pi") ``` **Method 3:** Mask the layer using a spatial Kernel Density Estimation (KDE) of sample density to exclude areas with low sampling density using the SpatialKDE package: ```{r, eval = FALSE} # First, install and load the SpatialKDE package if (!require("SpatialKDE", quietly = TRUE)) install.packages("SpatialKDE") library(SpatialKDE) ``` ```{r mask results 3, fig.width = 5.5, fig.height = 5, eval = FALSE} # Note: this code is not evaluated when building the vignette as it requires the SpatialKDE package # Spatial KDE requires an sf data.frame containing only POINTS that is in a projected coordinate system # The simulated coordinates are not projected, but for the purposes of this example, we'll pretend they are projected under the Goode Homolosine projection # This kind of arbitrary setting of crs should not be done for real data (see above example for how to properly project coordinates) lotr_sf <- st_as_sf(lotr_coords, coords = c("x", "y")) %>% st_set_crs("+proj=goode") # The grid layer must also be a RasterLayer for SpatialKDE # Here, we use the kriged raster as our grid to get a KDE layer of the same spatial extent and resolution grid <- raster(kgd) # See the SpatialKDE package for more options and details about using KDE kde_lyr <- kde( lotr_sf, kernel = "quartic", band_width = 15, decay = 1, grid = grid, ) # Visualize KDE layer ggplot_count(kde_lyr) + ggtitle("KDE") # Mask with mask_gd mgd <- mask_gd(kgd, kde_lyr, minval = 1) ggplot_gd(mgd) + ggtitle("Kriged & sample density masked pi") ``` Another nice visualization option is to add a "background" to your plots in the form of a raster or other sp or sf object (e.g., a country or range boundary) which can help provide geographic context: ```{r range map background, fig.width = 5, fig.height = 5, warning = FALSE} ggplot_gd(mgd, bkg = lotr_range) + ggtitle("Kriged & masked pi") ``` ## Parallelization To increase computational speed, users can perform any of the moving window functions calculations with parallelization using future::plan(). Checkout the [furrr package documentation](https://furrr.futureverse.org/) for more details on setting up the back end for parallelization. ```{r parallelization, fig.width = 5, fig.height = 5, eval = FALSE, message = FALSE} # Note: this code is not evaluated when building the vignette as it spawns multiple processes # setup parallel session future::plan("multisession", workers = 2) wgd <- window_gd(lotr_vcf, lotr_coords, lotr_lyr, stat = "pi", wdim = 7, fact = 3, rarify_n = 2, rarify_nit = 5, rarify = TRUE ) # end parallel session future::plan("sequential") ``` *Note:* the `parallel` and `ncores` arguments have been deprecated as of wingen 2.0.1. ## Other genetic diversity metrics You can calculate multiple statistics at once by providing a vector of statistic names to `stat`. ```{r, cache = TRUE, warning = FALSE, warning = FALSE, message = FALSE} multistat_wgd <- window_gd(lotr_vcf, lotr_coords, lotr_lyr, stat = c("pi", "Ho"), wdim = 7, fact = 3, rarify = FALSE, L = 100 ) ``` ```{r, fig.width = 5, fig.height = 5, warning = FALSE} ggplot_gd(multistat_wgd, bkg = lotr_range) ``` `"hwe"` and `"basic_stats"` take longer to calculate, so we do not evaluate these statistics here. We also change the parameters and reduce the dataset in the example to make it faster for users who wish to run it: ```{r, warning = FALSE, message = FALSE, eval = FALSE} hwe_wgd <- window_gd(lotr_vcf[1:10, ], lotr_coords, lotr_lyr, stat = "hwe", wdim = 3, fact = 5, rarify = FALSE, L = 100, sig = 0.10 ) bs_wgd <- window_gd(lotr_vcf[1:10, ], lotr_coords, lotr_lyr, stat = "basic_stats", wdim = 3, fact = 5, rarify = FALSE, L = 100 ) ggplot_gd(hwe_wgd, bkg = lotr_range) ggplot_gd(bs_wgd, bkg = lotr_range) ``` # General moving window We provide a `window_general` function that can be used to make moving window maps for other types of data inputs and functions. Unlike `window_gd`, `window_general` does not require a `vcfR` object or a path to a vcf file as input. The required input (`x`) depends on the statistic (`stat`) being calculated. For the standard genetic diversity statistics: - If `stat = "pi"` or `"biallelic_richness"`, `x` must be a dosage matrix with values of 0, 1, or 2 (*Note:* rows must be individuals). - If `stat = "Ho"`, `x` must be a heterozygosity matrix where values of 0 = homozygosity and values of 1 = heterozygosity (*Note:* rows must be individuals). - If `stat = "allelic_richness"`, `x` must be a `genind` type object. - If `stat = "basic_stats"`, `x` must be a `hierfstat` type object. For other statistics: - If `x` is a vector, `stat` can be any function that can be applied to a vector (e.g., `stat = mean, var, sum, etc.`). - If `x` is a matrix or data frame (*Note:* rows must be individuals), `stat` can be any function that takes a matrix or data frame and outputs a single numeric value (e.g., a function that produces a custom diversity index). *(Note: this functionality has not have been tested extensively and may produce errors, so use with caution)*. As an example, let's create a moving window map of our raster layer values (e.g., carrying capacity and conductance, in this case) at the sample coordinates: ```{r, fig.width = 5, fig.height = 5, cache = TRUE, warning = FALSE, message = FALSE} # First, we extract the raster values at those coordinates vals <- extract(lotr_lyr, lotr_coords) # Next, we run the window_general function with the env vector and set the `stat` to mean # Note: we can also provide additional arguments to functions, such as na.rm = TRUE we <- window_general(vals, coords = lotr_coords, lyr = lotr_lyr, stat = mean, wdim = 7, fact = 3, rarify_n = 2, rarify_nit = 5, rarify = TRUE, na.rm = TRUE ) ggplot_gd(we) + ggtitle("Mean raster value") ``` # Other genetic data input file types in wingen Although the default input data type for wingen is a VCF file, which is a standard file type to encode genomic data, wingen can accept a `genind` object as input to calculate allelic richness using `window_general()`. Given that some wingen functionality can be feasible using a `genind` object, users could easily convert various other genetic data file types to then run wingen. For example, genepop (.gen) is a typical file format for encoding microsatellite data, in which each allele within a locus is coded using a two- or three-digit system (i.e., in a diploid organism, in a two-digit system, each locus would be assigned four digits specifying each of two alleles. In a three-digit system in the same organism, each locus would be assigned six digits encoding those same two alleles). Users can convert a genepop file into a `genind` object using the `read.genepop()` function in the adegenet package, which automatically reads in a .gen file as a `genind` object. In many cases, microsatellite data may not be biallelic, and may contain more than one observed allele at a given locus; in such cases, running `window_general()` with a `genind` object is still feasible. An example of how one could use a `genind` object as input into `window_general()` is as follows: ```{r, fig.width = 5.5, fig.height = 5, cache = TRUE, warning = FALSE, message = FALSE, eval = FALSE} # We use the vcfR package to convert vcf to genind for our example library(vcfR) # Convert existing vcf example file into genind object: genind <- vcfR2genind(lotr_vcf) # Run window_general with no rarefaction we_gi <- window_general(genind, coords = lotr_coords, lyr = lotr_lyr, stat = "allelic_richness", wdim = 7, fact = 3, na.rm = TRUE ) ggplot_gd(we_gi) + ggtitle("Allelic richness from genind") ```