\documentclass{article} \usepackage{amsmath} \usepackage{enumerate} \usepackage{listings} \usepackage{natbib}[numbers, sort&compress] \setcitestyle{numbers,open={[},close={]}} \usepackage{url} \newcommand{\rcode}[1]{\lstinline[language=R,basicstyle=\normalsize\ttfamily]!#1!} \setcounter{section}{1} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Examples} \title{Fitting a zeta distribution to a {P} survey---number of groups data} \begin{document} <>= options(width=70) knitr::opts_chunk$set(message = FALSE, #background = opcolour, comment = "", highlight = TRUE, prompt = TRUE, tidy = TRUE, tidy.opts = list(arrow = FALSE), warning = FALSE) suppressWarnings(suppressPackageStartupMessages({library(fitPS)})) knitr::knit_hooks$set(document = function(x) { gsub('([\n]+\\\\end\\{knitrout\\}[\n]+)', '\n\\\\end\\{knitrout\\}\\\\noindent\n', paste(x, collapse = '\n')) }) @ We have, as noted throughout this paper, tried to make the statistical methodology as accessible as possible by implementing it in an R package. In this section we will demonstrate the use of the package with data from various surveys. \subsection{Fitting a zeta distribution} We start with the data from \citet{roux2001} who conducted a footwear survey, where shoes from 776 individuals were examined for glass. This data set is built into the package and can be accessed from the \rcode{Psurveys} object. That is, we can type: <>= data("Psurveys") roux = Psurveys$roux @ The package includes a special printing function that summarises the data for reading rather than displaying it in the way it is stored. R prints the values of objects (or variables) simply by typing their name. For example <>= roux @ It is very simple to fit a zeta distribution to this data set. We do this using the \rcode{fitDist} function. <>= fit = fitDist(roux) @ There is nothing special about the variable name \rcode{fit}. It could be anything, for example \rcode{x}, or \rcode{blah}. However we choose the variable name \rcode{fit} because it is easy to remember that it is a \emph{fitted} object. The package includes specialised functions for both printing and plotting the fitted object. The \rcode{print} method displays an estimate of the shape parameter $\alpha$, an estimate of the standard deviation---the standard error---of the estimate of $\alpha$ ($\widehat{\mathrm{sd}}(\hat{\alpha})=\mathrm{se}(\hat{\alpha})$). \textbf{Note:} it is important to understand that the value of the shape parameter that is displayed, and the value that is stored in the fitted object differ by 1. That is, $\alpha$ is shown, and $s = \alpha - 1$ is stored. This difference only has consequences if the fitted value is being used in conjunction with other functions. \subsubsection{Using the fitted distribution to estimate $P$ terms} The \rcode{print} method also displays the first 10 fitted probabilities from the model by default. <>= fit @ This information is probably sufficient for most casework. However, the package has a function, \rcode{probfun}, that returns a bespoke function that can calculate any probability term. This function is applied a fitted object. For example <>= P = probfun(fit) @ Again, \rcode{P} is just a variable name and we could have used anything. We have chosen \rcode{P} because this probability function returns $P$ terms. To use it, we type only need to provide the value of $k$, and the function will return $P_k$. For example <>= P(5) @ \subsection{Fitting a zero-inflated zeta distribution} We can also easily fit a zero-inflated zeta model using the \rcode{fitZIDist} function\footnote{Functions with mixed case names are often annoying. For that reason, the package also allows \rcode{fitZIdist} and \rcode{fitzidist}.}. As before, we can choose a variable name to store the results in. <>= fit.zi = fitZIDist(roux) fit.zi @ In the example above we fit a zero-inflated model to Roux et al.'s data, and print out the resulting fit. We get, as with the zeta model, estimates of the parameters and a default set of fitted values. The output is interesting in that we can see (from the value of $\hat{\pi}$) that the \emph{zero} part of the zero-inflated model is picking up about \Sexpr{round(100 * fit.zi$pi)}\% of the zeros. It is interesting to compare the estimates from the raw frequencies, the zeta model, and those of the ZIZ model. The estimates are shown in Table \ref{tab:ex1}. <>= library(xtable) raw = c(roux$data$rn/sum(roux$data$rn), 0) tbl = cbind(0:5, raw, fitted(fit, 6), fitted(fit.zi, 6)) colnames(tbl) = c("$k$", "$P_k^{raw}$", "$P_k^{zeta}$", "$P_k^{ZIZ}$") tbl = xtable(tbl, align = "ccccc", digits = c(0, 0, 4, 4, 4)) caption(tbl) = "Estimated probability that $k$ groups of glass would be found in shoes of a random member of the population based on the data of \\citep{roux2001}, the raw frequencies, and those produced from the zeta and ZIZ models respectively." label(tbl) = "tab:ex1" print(tbl, type="latex", include.rownames = FALSE, sanitize.text.function = function(x){x}) @ We can see from Table \ref{tab:ex1} that we now have a non-zero estimate for $P_5$, but this comes at a small cost, namely smaller probabilities for the preceding terms $P_0$--$P_4$. This is not necessarily a negative. The survey data is dominated by zeros. However, we think it likely that the raw sample estimates (for $P_0$--$P_4$) are over-estimates. The model reduces the estimated value, which is in line with our thinking. Interestingly, the effect of including the zero-inflation factor is to increase nearly all of the probabilities, with the exception of $P_1$. A natural question to ask is ``Which model is correct?'' The answer, unhelpfully, is ``Neither.'' This is because these are simply models. They can still help us without us having to believe that they are true. \subsection{Confidence intervals for the parameter estimates} The \texttt{fitPS} package provides a \rcode{confint} method for the fitted value. The method returns both a Wald confidence interval and profile likelihood interval. The two intervals are returned as elements of a \rcode{list} named \rcode{wald} and \rcode{prof} respectively. <>= ci = confint(fit) ci$wald ci$prof @ Readers may notice neither of these intervals contain the estimated value shown in the previous output. However, this simply because they are confidence intervals on $s^\prime$ and not $s$. This can be remedied by adding one to each interval: <>= ci$wald + 1 ci$prof + 1 @ The reason for not \emph{correcting} these intervals is that the method mostly exists to feed into other parts of the package, especially the \rcode{plot} method. \subsubsection{Bootstrapped and profile likelihood confidence regions for the zero-inflated zeta} The package includes the facility to compute both bootstrapped and profile likelihood confidence regions for the parameters of the zero-inflated zeta distribtion. It does, also, in fact compute bootstrapped confidence intervals for the zeta distribution, but we do not do demonstrate this functionality here. The \rcode{confint} function returns a confidence region if the fitted object contains information from a zero-inflated zeta fit. As an example, we will first compute profile likelihood confidence regions for the \citet{roux2001} data. To do this we use the fitted object we previously created, \rcode{fit.zi}, and, although not required, we supply a set of two levels so that we can compute both an 80\% and a 95\% confidence region. \rcode{confit} returns a list of confidence regions---one for each level---each of which are simply a set of $x$ and $y$ coordinates corresponding to the appropriate contour line. We can use this information for plotting. The code to produce Figure \ref{fig:confregion} is given below. <>= cr = confint(fit.zi, level = c(0.80, 0.95)) plot(cr[["0.95"]], type = "l") polygon(cr[["0.8"]], border = "red") legend("topright", lty = 1, lwd = 2, col = c("red", "black"), legend = c("80%", "95%"), bty = "n") @ <>= pdf(file = "confregion.pdf", height = 3.937008) cr = confint(fit.zi, level = c(0.80, 0.95)) plot(cr[["0.95"]], type = "l") polygon(cr[["0.8"]], border = "red") legend("topright", lty = 1, lwd = 2, col = c("red", "black"), legend = c("80%", "95%"), bty = "n") graphics.off() @ \begin{figure}[ht] \centering \includegraphics[width=0.95\textwidth,keepaspectratio]{confregion} \caption{80\% and 95\% confidence regions for the parameters of a zero-inflated zeta distribution fitted to the \citet{roux2001} data.} \label{fig:confregion} \end{figure} A bootstrapped confidence region can be computed using the \rcode{bootCI} function. The \rcode{bootCI} function includes the facility to plot the resulting confidence region(s) as and hide or display the function's progress. The latter is important because this procedure is numerically intensive, and even with utilising parallel processing, can be quite slow. The code below produces Figure \ref{fig:bootconfregion} <>= bcr = bootCI(roux, model = "ziz", plot = TRUE, silent = TRUE) @ <>= pdf("bootconfregion.pdf", height = 3.937008) bcr = bootCI(roux, model = "ziz", plot = TRUE, silent = TRUE) graphics.off() @ \begin{figure}[ht] \centering \includegraphics[width=0.95\textwidth,keepaspectratio]{bootconfregion} \caption{A 95\% bootstrapped confidence region for the parameters of a zero-inflated zeta distribution fitted to the \citet{roux2001} data.} \label{fig:bootconfregion} \end{figure} \subsection{Comparing two surveys} We can use the methodology that has been demonstrated so far to compare surveys. One reason for comparing surveys is to explore the hypothesis that there is no difference in the underlying ``true\footnote{Readers may be aware that one of the authors is fundementally Bayesian at heart, and the so the concept of the true value of a population parameter is antithetical to the school of thought. We, however, proceed on the basis that the Frequenist school of thought is not incorrect, but just differs in interpretation.}'' value of $\alpha$. If there is insufficient evidence to reject this hypothesis, then one may feel justified in combining data from two surveys. In the first instance we will take an ad-hoc approach, and then treat this problem more formally. In our ad-hoc approach we will compare confidence intervals for two-surveys. If these confidence intervals overlap, then we might conclude that there is insufficient evidence in the data to suggest that the estimates of $\alpha$ are different. We will illustrate this with the surveys conducted by \citet{lau1997} and \citet{jackson2013}. \citet{lau1997} surveyed the clothing of 213 Canadian high school students and observed two sets of clothing with one fragment on each. Similarly, \citet{jackson2013} surveyed 232 ``randomly'' selected members of the population of New South Wales in Australia. We place ``randomly'' in quotes because this was not a true random sample, but rather a convenience sample. That being said, it is unlikely that using a truly random mechanism would have significantly changed the results. \citet{jackson2013} found a single fragment of glass on each of six people. The data from each of these two surveys is summarised in Table \ref{tab:lau_and_jackson}. <>= lau = Psurveys$lau jackson = Psurveys$jackson tbl = cbind(0:1, lau$data$rn, jackson$data$rn) library(xtable) tbl = xtable(tbl, digits = 0) align(tbl) = "cr|r|r" label(tbl) = "tab:lau_and_jackson" caption(tbl) = "Survey results from \\citet{lau1997} and \\citet{jackson2013}." print(tbl, include.rownames = FALSE, include.colnames = FALSE, contents.only = TRUE, add.to.row = list(pos = list(-1, 0), command = c("\\multicolumn{1}{c}{} & \\multicolumn{2}{c}{$r_n$} \\\\", "$n$ & Lau et al. & Jackson et al. \\\\")), hline.after = 0, sanitize.colnames.function = function(x){x}) @ Visual inspection of these surveys would suggest that they are fairly similar. We can fit a zeta distribution to each survey, and then compute a confidence interval for each survey. Again, these data sets are included in the \texttt{fitPS} package. <>= lau = Psurveys$lau jackson = Psurveys$jackson fit.lau = fitDist(lau) fit.jackson = fitDist(jackson) confint(fit.lau)$wald confint(fit.jackson)$wald @ We can see from the output that there is substantial overlap between these two (Wald) confidence intervals. The results using profile likelihood intervals lead to the same conclusion but are not shown. We can test this more formally. Specifically, we wish to test the (null) hypothesis that \[ H_0:\alpha_{1} = \alpha_{2}\mbox{ or equivalently }H_0:\alpha_{1} - \alpha_{2} = 0, \] where $\alpha_1$ is the true value of $\alpha$ for the the Lau et al. data, and $\alpha_2$ is the true value of $\alpha$ for the the Jackson et al. data. We choose a two-tailed alternative, meaning we are not concerned about the sign of any difference, but simply the magnitude of the difference. That is, \[ H_1:\alpha_{1} \neq \alpha_{2}\mbox{ or equivalently }H_1:\alpha_{1} - \alpha_{2} \neq 0. \] We test this hypothesis by constructing a test statistic, then computing a \emph{P}-value under the assumption that the null hypothesis is true. We are interested in the difference between the two population values of $\alpha$. We estimate this by computing the difference in the sample estimates. That is, our estimate of $\alpha_{1}-\alpha_{2}$, is given by $\hat{\alpha}_{1}-\hat{\alpha}_{2}$, where $\hat{\alpha}_{1}$ and $\hat{\alpha}_{2}$ are the maximum likelihood estimates based on the survey data. We scale this difference by the estimated standard deviation in the difference, that is, by the standard error of the difference, $se(\hat{\alpha}_{1}-\hat{\alpha}_{2})$. We estimate this---to keep the statistical theory to a minimum---as the sum of the square root of the two estimated variances, i.e. \[ se(\hat{\alpha}_{1}-\hat{\alpha}_{2}) = \sqrt{\hat{V}(\hat{\alpha}_1)+\hat{V}(\hat{\alpha}_2)}. \] Our test statistic is then \[ Z_0 = \frac{\hat{\alpha}_{1}-\hat{\alpha}_{2}}{se(\hat{\alpha}_{1}-\hat{\alpha}_{2})}. \] It can be shown that this test statistic follows an approximate normal distribution under the null hypothesis. That means we can compute our \emph{P}-value by evaluating \begin{align*} P&=\Pr(Z > |Z_0|) \\ &=2(1 - \Pr(Z < |Z_0|)). \end{align*} All this theory has been put in a function called \rcode{compareSurveys} <>= compareSurveys(lau, jackson) @ We can see that the \emph{P}-value is \Sexpr{round(compareSurveys(lau, jackson)$p.value,2)} (2 d.p.). This is significantly larger than either 0.05, or 0.01. Based on this we would conclude that there is insufficient evidence to reject the null hypothesis of a common value of $\alpha$, and therefore it may be sensible to combine data from these two surveys. We could have also used the theory of likelihood ratio tests \citep{wiki:lrt} to test this hypothesis, but that is beyond the scope of this article. We note, however, that the \rcode{fitPS} package contains a function called \rcode{compareSurveysLRT} which can compare two \emph{or more} surveys simultaneously using a likelihood ratio test. \begin{thebibliography}{4} \providecommand{\natexlab}[1]{#1} \providecommand{\url}[1]{\texttt{#1}} \expandafter\ifx\csname urlstyle\endcsname\relax \providecommand{\doi}[1]{doi: #1}\else \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi \bibitem[Roux et~al.(2001)Roux, Kirk, Benson, {Van Haren}, and Petterd]{roux2001} C.~Roux, R.~Kirk, S.~Benson, T.~{Van Haren}, and C.~I. Petterd. \newblock Glass particles in footwear of members of the public in south-eastern australia---a survey. \newblock \emph{Forensic Science International}, 116\penalty0 (2):\penalty0 149--156, 2001. \newblock \doi{https://doi.org/10.1016/S0379-0738(00)00355-8}. \bibitem[Lau et~al.(1997)Lau, Beveridge, Callowhill, Conners, Foster, Groves, Ohashi, Sumner, and Wong]{lau1997} L.~Lau, A.~D. Beveridge, B.~C. Callowhill, N.~Conners, K.~Foster, R.~J. Groves, K.~N. Ohashi, A.~M. Sumner, and H.~Wong. \newblock The frequency of occurrence of paint and glass on the clothing of high school students. \newblock \emph{Canadian Society of Forensic Science Journal}, 30\penalty0 (4):\penalty0 233--240, 1997. \newblock \doi{10.1080/00085030.1997.10757103}. \bibitem[Jackson et~al.(2013)Jackson, { Maynard}, Cavanagh-Steer, Dusting, and Roux]{jackson2013} F.~Jackson, P.~{ Maynard}, K.~Cavanagh-Steer, T.~Dusting, and C.~Roux. \newblock A survey of glass found on the headwear and head hair of a random population vs. people working with glass. \newblock \emph{Forensic Science International}, 226\penalty0 (1):\penalty0 125--131, 2013. \newblock \doi{https://doi.org/10.1016/j.forsciint.2012.12.017}. \bibitem[contributors(2024)]{wiki:lrt} Wikipedia contributors. \newblock Likelihood-ratio test, 2024. \newblock URL \url{https://en.wikipedia.org/wiki/Likelihood-ratio_test}. \newblock [Online; accessed 8-January-2024]. \end{thebibliography} \end{document}