\documentclass[letterpaper]{article} %\VignetteIndexEntry{Algorightms and Equations} %\VignetteEngine{knitr::knitr} \usepackage{graphicx} \usepackage[colorlinks = true, urlcolor = blue, citecolor = blue, linkcolor = blue]{hyperref} \usepackage{array} \usepackage{color} \usepackage[dvipsnames,svgnames,table]{xcolor} \usepackage[utf8]{inputenc} % for UTF-8/single quotes from sQuote() \usepackage{fullpage} \usepackage{mathtools} \usepackage{makeidx} \usepackage{longtable} \usepackage{natbib} % for bold symbols in mathmode \usepackage{bm} \newcommand{\R}{\mathbb{R}} \newcommand{\m}[1]{\mathbf{#1}} \newcommand{\tab}{\hspace*{1em}} \newcolumntype{H}{>{\setbox0=\hbox\bgroup}c<{\egroup}@{}} \newcommand{\cmdlink}[2]{% \texttt{\hyperref[#1]{#2}}% } \newcommand{\seclink}[2]{% \textsc{\hyperref[#1]{#2}}% } \newcommand{\poppr}{\textit{poppr}} \newcommand{\Poppr}{\textit{Poppr}} \newcommand{\adegenet}{\textit{adegenet}} \newcommand{\Adegenet}{\textit{Adegenet}} \newcommand{\tline}{ \noindent \rule{\textwidth}{1pt} \par } \newcommand{\bline}{ \noindent \rule{\textwidth}{1pt} \kern1pt } \newcommand{\jala}{ \includegraphics[height = 5mm, keepaspectratio=true]{jalapeno-poppers} } \newcommand{\revjala}{ \scalebox{-1}[1]{\jala{}} } \title{Algorithms and equations utilized in poppr version \Sexpr{packageVersion("poppr")}} \author{Zhian N. Kamvar$^{1}$\ and Niklaus J. Gr\"unwald$^{1,2}$\\\scriptsize{1) Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR}\\\scriptsize{2) Horticultural Crops Research Laboratory, USDA-ARS, Corvallis, OR}} \begin{document} <>= knitr::opts_knit$set(out.format = "latex") thm <- knitr::knit_theme$get("acid") knitr::knit_theme$set(thm) knitr::opts_chunk$set(concordance=TRUE) knitr::opts_chunk$set(message = FALSE, warning = FALSE) knitr::opts_chunk$set(out.width = '0.95\\linewidth', fig.align = "center", fig.show = 'asis') @ \maketitle \begin{abstract} This vignette is focused on simply explaining the different algorithms utilized in calculations such as the index of association and different distance measures. Many of these are previously described in other papers and it would be prudent to cite them properly if they are used. \end{abstract} \begin{figure}[b] \centering \label{logo} \includegraphics{jalapeno_logo} \end{figure} \newpage \begingroup \hypersetup{linkcolor=black} \tableofcontents \endgroup \section{Mathematical representation of data in \adegenet{} and \poppr{}} The sections dealing with the index of association and genetic distances will be based on the same data structure, a matrix with samples in rows and alleles in columns. The number of columns is equal to the total number of alleles observed in the data set. Much of this description is derived from \adegenet{}'s \texttt{dist.genpop} manual page. \subsection{Caveat} The information here describes the data as it would be passed into the distance functions. Data represented in the genind objects are now counts of alleles instead of frequencies of alleles. Otherwise, these equations still hold. \begin{quote} Let \textbf{A} be a table containing allelic frequencies with $t$ samples\footnote{populations or individuals} (rows) and $m$ alleles (columns).\\ \end{quote} The above statement describes the table present in genind or genpop object where, instead of having the number of columns equal the number of loci, the number of columns equals the number of observed alleles in the entire data set. \begin{quote} Let $\nu$ be the number of loci. The locus $j$ gets $m(j)$ alleles. \begin{equation} m=\sum_{j=1}^{\nu} m(j) \end{equation} \end{quote} So, if you had a data set with 5 loci that had 2 alleles each, your table would have ten columns. Of course, codominant loci like microsatellites have varying numbers of alleles. \begin{quote} For the row $i$ and the modality $k$ of the variable $j$, notice the value \begin{equation} a_{ijk}\\ (1 \leq i \leq t,\\ 1 \leq j \leq \nu,\\ 1 \leq k \leq m(j)) \end{equation} \begin{equation} a_{ij\cdot}=\sum_{k=1}^{m(j)}a_{ijk} \end{equation} \begin{equation} p_{ijk}=\frac{a_{ijk}}{a_{ij\cdot}} \end{equation} \end{quote} The above couple of equations are basically defining the allele counts ($a_{ijk}$) and frequency ($p_{ijk}$). Remember that $i$ is individual, $j$ is locus, and $k$ is allele. The following continues to describe properties of the frequency table used for analysis: \begin{quote} \begin{equation} p_{ij\cdot}=\sum_{k=1}^{m(j)}p_{ijk}=1 \end{equation} \end{quote} The sum of all allele frequencies for a single population (or individual) at a single locus is one. \begin{quote} \begin{equation} p_{i{\cdot}\cdot}=\sum_{j=1}^{\nu}p_{ij\cdot}=\nu \end{equation} \end{quote} The sum of all allele frequencies over all loci is equal to the number of loci. \begin{quote} \begin{equation} p_{{\cdot}{\cdot}\cdot}=\sum_{j=1}^{\nu}p_{i{\cdot}\cdot}=t\nu \end{equation} \end{quote} The the sum of the entire table is the sum of all loci multiplied by the number of populations (or individuals). % % % %-----------------------------------------------------------------------------% % % \section{The Index of Association} \label{indexassoc} The index of association was originally developed by A.H.D. Brown analyzing population structure of wheat and has been widely used as a tool to detect clonal reproduction within populations \citep{Brown:1980, Smith:1993}. Populations whose members are undergoing sexual reproduction, whether it be selfing or out-crossing, will produce gametes via meiosis, and thus have a chance to shuffle alleles in the next generation. Populations whose members are undergoing clonal reproduction, however, generally do so via mitosis. The most likely mechanism for a change in genotype, for a clonal organism, is mutation. The rate of mutation varies from species to species, but it is rarely sufficiently high to approximate a random shuffling of alleles. The index of association is a calculation based on the ratio of the variance of the raw number of differences between individuals and the sum of those variances over each locus \citep{Smith:1993}. It can also be thought of as the observed variance over the expected variance. If both variances are equal, then the index is zero after subtracting one (from Maynard-Smith, 1993 \citep{Smith:1993}): \begin{equation} \label{eq:I_A} I_A = \frac{V_O}{V_E}-1 \end{equation} Any sort of marker can be used for this analysis as it only counts differences between pairs of samples. This can be thought of as a distance whose maximum is equal to the number of loci multiplied by the ploidy of the sample. This is calculated using an absolute genetic distance. Remember that in \poppr{}, genetic data is stored in a table where the rows represent samples and the columns represent potential allelic states grouped by locus. Notice also that the sum of the rows all equal one. \Poppr{} uses this to calculate distances by simply taking the sum of the absolute values of the differences between rows. The calculation for the distance between two individuals at a single locus $j$ with $m(j)$ allelic states and a ploidy of $l$ is as follows\footnote{Individuals with Presence / Absence data will have the $l/2$ term dropped.}: \begin{equation} \label{eq:ia_d} d(a,b)=\frac{l}{2} \sum_{k=1}^{m(j)} |p_{ajk} - p_{bjk}| \end{equation} % \begin{equation} % d = \displaystyle \frac{k}{2}\sum_{i=1}^{a} \mid ind_{Ai} - ind_{Bi}\mid % \end{equation} \noindent To find the total number of differences between two individuals over all loci, you just take $d$ over $\nu$ loci, a value we'll call $D$: \begin{equation} \label{eq:ia_D} D(a,b) = \displaystyle \sum_{i=1}^{\nu} d_i \end{equation} An interesting observation: $D(a,b)/(l\nu)$ is Prevosti's distance. These values are calculated over all possible combinations of individuals in the data set, ${n \choose 2}$ after which you end up with ${n \choose 2}\cdot{}\nu$ values of $d$ and ${n \choose 2}$ values of $D$. Calculating the observed variances is fairly straightforward (modified from Agapow and Burt, 2001) \citep{Agapow:2001}: \begin{equation} \label{eq:V_O} V_O = \frac{\displaystyle \sum_{i=1}^{n \choose 2} D_{i}^2 - \frac{\left(\displaystyle\sum_{i=1}^{n \choose 2} D_{i}\right)^2}{{n \choose 2}}}{{n \choose 2}} \end{equation} Calculating the expected variance is the sum of each of the variances of the individual loci. The calculation at a single locus, $j$ is the same as the previous equation, substituting values of $D$ for $d$ \citep{Agapow:2001}: \begin{equation} \label{eq:var_j} var_j = \frac{\displaystyle \sum_{i=1}^{n \choose 2} d_{i}^2 - \frac{\left(\displaystyle\sum_{i=1}^{n \choose 2} d_i\right)^2}{{n \choose 2}}}{{n \choose 2}} \end{equation} The expected variance is then the sum of all the variances over all $\nu$ loci \citep{Agapow:2001}: \begin{equation} \label{eq:V_E} V_E = \displaystyle \sum_{j=1}^{\nu} var_j \end{equation} Now you can plug the sums of equations (\ref{eq:V_O}) and (\ref{eq:V_E}) into equation (\ref{eq:I_A}) to get the index of association. Of course, Agapow and Burt showed that this index increases steadily with the number of loci, so they came up with an approximation that is widely used, $\bar r_d$ \citep{Agapow:2001}. For the derivation, see the manual for \textit{multilocus}. The equation is as follows, utilizing equations (\ref{eq:V_O}), (\ref{eq:var_j}), and (\ref{eq:V_E}) \citep{Agapow:2001}: \begin{equation} \label{eq:r_d} \bar{r}_d = \frac{V_O - V_E} {2\displaystyle \sum_{j=1}^{\nu}\displaystyle \sum_{k \neq j}^{\nu}\sqrt{var_j\cdot{}var_k}} \end{equation} \section{Genetic distances} Genetic distances are great tools for analyzing diversity in populations as they are the basis for creating dendrograms with bootstrap support and also for AMOVA. This section will simply present different genetic distances along with a few notes about them. Most of these distances are derived from the \textit{ade4} and \adegenet{} packages, where they were implemented as distances between populations. \Poppr{} extends the implementation to individuals as well (with the exception of Bruvo's distance). \begin{table}[ht] \centering \caption{Distance measures and their respective assumptions} \begin{tabular}{lllll} \hline Method & Function & Assumption & Euclidean & Citation\\ \hline Prevosti & \texttt{prevosti.dist} & - & No & \citep{prevosti1975distances}\\ & \texttt{diss.dist} & & & \\ Nei & \texttt{nei.dist} & Infinite Alleles & No & \citep{nei1972genetic, nei1978estimation}\\ & & Genetic Drift & & \\ Edwards & \texttt{edwards.dist} & Genetic Drift & Yes & \citep{edwards1971distances}\\ Reynolds & \texttt{reynolds.dist} & Genetic Drift & Yes & \citep{reynolds1983estimation}\\ Rogers & \texttt{rogers.dist} & - & Yes & \citep{rogers1972measures}\\ Bruvo & \texttt{bruvo.dist} & Stepwise Mutation & No & \citep{Bruvo:2004}\\ \hline \end{tabular} \end{table} \subsection{Distances that assume genetic drift} \subsubsection{Nei's 1978 Distance} \label{distance:nei} \begin{equation} D_{Nei}(a,b)= -\ln\left(\frac{\sum_{k=1}^{\nu} \sum_{j=1}^{m(k)} p_{ajk} p_{bjk}}{\sqrt{\sum_{k=1}^{\nu} \sum_{j=1}^{m(k)} {(p_{ajk}) }^2}\sqrt{\sum_{k=1}^{\nu} \sum_{j=1}^{m(k)} {(p_{bjk})}^2}}\right) \end{equation} Note: if comparing individuals in \poppr{}, those that do not share any alleles normally receive a distance of $\infty$. As you cannot draw a dendrogram with infinite branch lengths, infinite values are converted to a value that is equal to an order of magnitude greater than the largest finite value. \subsubsection{Edwards' angular distance} \label{distance:edwards} \begin{equation} D_{Edwards}(a,b)=\sqrt{1-\frac{1}{\nu} \sum_{k=1}^{\nu} \sum_{j=1}^{m(k)} \sqrt{p_{ajk} p_{bjk}}} \end{equation} \subsubsection{Reynolds' coancestry distance} \label{distance:reynolds} \begin{equation} D_{Reynolds}(a,b)=\sqrt{\frac{\sum_{k=1}^{\nu} \sum_{j=1}^{m(k)}{(p_{ajk} - p_{bjk})}^2}{2 \sum_{k=1}^{\nu} \left(1- \sum_{j=1}^{m(k)}p_{ajk} p_{bjk}\right)}} \end{equation} \subsection{Distances without assumptions} \subsubsection{Rogers' distance} \label{distance:rogers} \begin{equation} D_{Rogers}(a,b)=\frac{1}{\nu} \sum_{k=1}^{\nu} \sqrt{\frac{1}{2} \sum_{j=1}^{m(k)}{(p_{ajk} - p_{bjk})}^2} \end{equation} \subsubsection{Prevosti's absolute genetic distance} \label{distance:prevosti} \begin{equation} D_{Prevosti}(a,b)=\frac{1}{2{\nu}} \sum_{k=1}^{\nu} \sum_{j=1}^{m(k)} |p_{ajk} - p_{bjk}| \end{equation} Note: for AFLP data, the $2$ is dropped. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Bruvo's distance (stepwise mutation for microsatellites)} \label{bruvo} Bruvo's distance between two individuals calculates the minimum distance across all combinations of possible pairs of alleles at a single locus and then averaging that distance across all loci \citep{Bruvo:2004}. The distance between each pair of alleles is calculated as\footnote{Notation presented unmodified from Bruvo et al, 2004}\citep{Bruvo:2004}: \begin{equation} \label{eq:m_x} m_x = 2^{-\mid x \mid} \end{equation} \begin{equation} \label{eq:d_a} d_a = 1 - m_x \end{equation} Where $x$ is the number of steps between each allele. So, let's say we were comparing two haploid $(k = 1)$ individuals with alleles 228 and 244 at a locus that had a tetranucleotide repeat pattern (CATG$)^n$. The number of steps for each of these alleles would be $228/4 = 57$ and $244/4 =61$, respectively. The number of steps between them is then $\mid 57 - 61 \mid = 4$. Bruvo's distance at this locus between these two individuals is then $1-2^{-4} = 0.9375$. For samples with higher ploidy ($k$), there would be $k$ such distances of which we would need to take the sum \citep{Bruvo:2004}. \begin{equation} \label{eq:s_i} s_i = \displaystyle \sum_{a=1}^{k} d_a \end{equation} Unfortunately, it's not as simple as that since we do not assume to know phase. Because of this, we need to take all possible combinations of alleles into account. This means that we will have $k^2$ values of $d_a$, when we only want $k$. How do we know which $k$ distances we want? We will have to invoke parsimony for this and attempt to take the minimum sum of the alleles, of which there are $k!$ possibilities \citep{Bruvo:2004}: \begin{equation} \label{eq:d_l} d_l = \frac{1}{k}\left(\min_{i \dotsc k!} s_i\right) \end{equation} \noindent Finally, after all of this, we can get the average distance over all loci \citep{Bruvo:2004}. \begin{equation} \label{eq:D} D = \frac{1}{l}\sum_{i=1}^l d_i \end{equation} \noindent This is calculated over all possible combinations of individuals and results in a lower triangle distance matrix over all individuals. \subsubsection{Special cases of Bruvo's distance} \label{appendix:algorithm:bruvospecial} As shown in the above section, ploidy is irrelevant with respect to calculation of Bruvo's distance. However, since it makes a comparison between all alleles at a locus, it only makes sense that the two loci need to have the same ploidy level. Unfortunately for polyploids, it's often difficult to fully separate distinct alleles at each locus, so you end up with genotypes that appear to have a lower ploidy level than the organism \citep{Bruvo:2004}. To help deal with these situations, Bruvo has suggested three methods for dealing with these differences in ploidy levels \citep{Bruvo:2004}: \begin{itemize} \item{Infinite Model -} The simplest way to deal with it is to count all missing alleles as infinitely large so that the distance between it and anything else is 1. Aside from this being computationally simple, it will tend to inflate distances between individuals. \item{Genome Addition Model -} If it is suspected that the organism has gone through a recent genome expansion, the missing alleles will be replace with all possible combinations of the observed alleles in the shorter genotype. For example, if there is a genotype of [69, 70, 0, 0] where 0 is a missing allele, the possible combinations are: [69, 70, 69, 69], [69, 70, 69, 70], [69, 70, 70, 69], and [69, 70, 70, 70]. The resulting distances are then averaged over the number of comparisons. \item{Genome Loss Model -} This is similar to the genome addition model, except that it assumes that there was a recent genome reduction event and uses the observed values in the full genotype to fill the missing values in the short genotype. As with the Genome Addition Model, the resulting distances are averaged over the number of comparisons. \item{Combination Model -} Combine and average the genome addition and loss models. \end{itemize} As mentioned above, the infinite model is biased, but it is not nearly as computationally intensive as either of the other models. The reason for this is that both of the addition and loss models requires replacement of alleles and recalculation of Bruvo's distance. The number of replacements required is $n^k$ where $n$ is the number of potential replacements and $k$ is the number of alleles to be replaced. To reduce the number of calculations and assumptions otherwise, Bruvo's distance will be calculated using the largest observed ploidy in pairwise comparisons. This means that when comparing [69,70,71,0] and [59,60,0,0], they will be treated as triploids. \subsubsection{Choosing a model} \label{appendix:algorithm:bruvomodel} By default, the implementation of Bruvo's distance in \poppr{} will utilize the combination model. This is implemented by setting both the \texttt{add} and \texttt{loss} arguments to \texttt{TRUE}. For other models use the following table for reference: \begin{table}[ht] \centering \begin{tabular}{ll} \hline Model & Arguments \\ \hline Infinite & \texttt{add = FALSE, loss = FALSE} \\ Genome Addition & \texttt{add = TRUE, loss = FALSE} \\ Genome Loss & \texttt{add = FALSE, loss = TRUE} \\ Combination (default) & \texttt{add = TRUE, loss = TRUE} \\ \hline \end{tabular} \end{table} \subsection{Tree topology} All of these distances were designed for analysis of populations. When applying them to individuals, we must change our interpretations. For example, with Nei's distance, branch lengths increase linearly with mutation \citep{nei1972genetic,nei1978estimation}. When two populations share no alleles, then the distance becomes infinite. However, we expect two individuals to segregate for different alleles more often than entire populations, thus we would expect exaggerated internal branch lengths separating clades. To demonstrate the effect of the different distances on tree topology, we will use 5 diploid samples at a single locus demonstrating a range of possibilities: \begin{table}[ht] \centering \begin{tabular}{c} \hline Genotype \\ \hline 1/1 \\ 1/2 \\ 2/3 \\ 3/4 \\ 4/4 \\ \hline \end{tabular} \caption{Table of genotypes to be used for analysis} \end{table} <<>>= library("poppr") dat.df <- data.frame(Genotype = c("1/1", "1/2", "2/3", "3/4", "4/4")) dat <- as.genclone(df2genind(dat.df, sep = "/", ind.names = dat.df[[1]])) @ We will now compute the distances and construct neighbor-joining dendrograms using the package \textit{ape}. This allows us to see the effect of the different distance measures on the tree topology. <<>>= distances <- c("Nei", "Rogers", "Edwards", "Reynolds", "Prevosti") dists <- lapply(distances, function(x){ DISTFUN <- match.fun(paste(tolower(x), "dist", sep = ".")) DISTFUN(dat) }) names(dists) <- distances # Adding Bruvo's distance at the end because we need to specify repeat length. dists$Bruvo <- bruvo.dist(dat, replen = 1) library("ape") par(mfrow = c(2, 3)) x <- lapply(names(dists), function(x){ plot(nj(dists[[x]]), main = x, type = "unrooted") add.scale.bar(lcol = "red", length = 0.1) }) @ \section{AMOVA} AMOVA in \poppr{} acts as a wrapper for the \textit{ade4} implementation, which is an implementation of Excoffier's original formulation \citep{excoffier1992analysis}. As the calculation relies on a genetic distance matrix, \poppr{} calculates the distance matrix as the number of differing sites between two genotypes using equation \ref{eq:ia_D}. This is equivalent to Prevosti's distance multiplied by the ploidy and number of loci. This is also equivalent to Kronecker's delta as presented in \citep{excoffier1992analysis}. As is the practice from \textsc{Arlequin}, within sample variance is calculated by default. Within sample variance can only be calculated on diploid data. \section{Genotypic Diversity} Many of the calculations of genotypic diversity exist within the package \textit{vegan} in the \texttt{diversity} function. Descriptions of most calculations can be found in the paper by \cite{Grunwald:2003}. \subsection{Rarefaction, Shannon-Wiener, Stoddart and Taylor indices} Stoddart and Taylor's index is also known as Inverse Simpson's index. A detailed description of these can be found in the ``Diversity'' vignette in \textit{vegan}. You can access it by typing \texttt{vignette("diversity-vegan")} \subsection{Evenness ($E_{5}$)} Evenness ($E_{5}$) is essentially the ratio of the number of abundant genotypes to the number of rarer genotypes. For the function \texttt{locus\_table}, these indices represent the abundance of alleles. This is simply calculated as \begin{equation} E_{5} = \frac{(1/\lambda) - 1}{e^{H} - 1} \end{equation} Where $1/\lambda$ is Stoddart and Taylor's index and $H$ is Shannon diversity \citep{Stoddart:1988,Shannon:1948}. The reason for this formulation is because $1/\lambda$ is weighted for more abundant genotypes and $H$ is weighted for rarer genotypes. \subsection{Hexp} $H_{exp}$ stands for Nei's 1978 expected heterozygosity, also known as gene diversity \citep{nei1978estimation}. Previously, in \poppr{}, it was calculated over multilocus genotypes, which was more analogous to a correction for Simpson's index. This has been corrected in version 2.0 and is now correctly calculated over alleles. The following is adapted from equations 2 and 3 in \citep{nei1978estimation}: \begin{equation} h = \left(\frac{n}{n-1}\right) 1 - \sum_{i = 1}^k{p^{2}_{i}} \end{equation} \begin{equation} H_{exp} = \sum_{l = 1}^m{h_l/m} \end{equation} where $p_i$ is the population frequency of the $i^{th}$ allele in a locus, $k$ is the number of alleles in a particular locus, $n$ represents the number of \textbf{observed alleles} in the population, and $h_l$ is the value of $h$ at the $l^{th}$ locus. Nei's definition originally corrected the sum of squared allele frequencies with $2N/(2N - 1)$, where $N$ is the number of diploid samples \citep{nei1978estimation}. This definition is still congruent with that since we would observe $2N$ alleles in a diploid locus. The only caveat is that populations where allelic dosage is unknown will have an inflated estimate of allelic diversity due to the inflation of rare alleles. \section{Probability of Genotypes: $p_{gen}$} The metric $p_{gen}$ represents the probability that a given multilocus genotype will occur in a population \citep{parks1993study, arnaud2007standardizing}. For $\nu$ loci, it is calculated as: \begin{equation} h = \begin{cases} 1 \text{ if heterozygous},\\ 0 \text{ if homozygous or haploid} \end{cases} \end{equation} \begin{equation} p_{gen} = \prod_{j = 1}^\nu\left(2^{h}\prod_{i = 1}^{m(j)} p_{ij}\right) \\ \end{equation} where $p_{ij}$ represents the allele frequency of the $i^{th}$ allele present in the sample at the $j^{th}$ locus derived using the round-robin method \citep{parks1993study, arnaud2007standardizing}. The round-robin allele frequencies for a given locus are derived by first clone-correcting with the genotypes found at all other loci \citep{parks1993study, arnaud2007standardizing}. Any allele that ends up with a zero frequency are assigned a value of $1/N$, where $N$ is the number of samples. \section{Probability a genotype is derived from sexual reproduction: $p_{sex}$} The probability of a genotype can provide information as to how many times we expect to see that genotype repeated in our population. \citet{parks1993study} defined the probability of encountering a genotype a second time from an independent reproductive event as: \begin{equation} p_{sex} = 1 - (1 - p_{gen})^G \end{equation} Where $G$ is the number of multilocus genotypes in the data. Finding a genotype $n$ times out of $N$ samples is derived from the binomial distribution \citep{parks1993study, arnaud2007standardizing}: \begin{equation} p_{sex} = \sum_{i = n}^N {N \choose i} \left(p_{gen}\right)^i\left(1 - p_{gen}\right)^{N - i} \end{equation} The \poppr{} function \texttt{psex()} can calculate both of these quantities. \bibliographystyle{authordate1} \bibliography{the_bibliography} \end{document}