% \VignetteIndexEntry{BGmix Tutorial} % \VignetteKeywords{Expression Analysis} % \VignettePackage{BGmix} \documentclass[11pt]{article} \usepackage{amsmath,fullpage} %\usepackage{graphicx} \usepackage{hyperref} %% bibliography \usepackage{natbib} \newcommand{\nc}{\newcommand} \nc{\be}{\begin{eqnarray}} \nc{\ee}{\end{eqnarray}} \nc{\pr}{\ensuremath{\mathbb{P}}} \parindent 0in \begin{document} \title{\bf BGmix} \author{Alex Lewin$^*$, Natalia Bochkina$^\dag$} \maketitle \begin{center} $^*$Centre for Biostatistics, Department of Epidemiology and Public Health,\\ Imperial College London\\ $^\dag$School of Mathematics, University of Edinburgh\\ \url{http://www.bgx.org.uk/alex/}\\ {\tt a.m.lewin@imperial.ac.uk} \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Overview} The {\tt BGmix} package implements the models used in \cite{lewin07} for finding differential expression for genes in two or more samples. When there are two samples, a 3-component mixture is used to classify genes as over-expressed, under-expressed and non-differentially expressed, and gene variances are modelled exchangeably to allow for variability between genes. The model is fully Bayesian, estimating the proportion of differentially expressed genes and the mixture parameters simultaneously. The model can also be run with unstructured priors, for use with multi-class data.\\ Several different parametric models are possible. An important part of the analysis is to check if the model is a reasonable fit to the data, and we do this via predictive checks.\\ The analysis is carried out using Markov Chain Monte Carlo. Convergence of the output can be checked using the {\tt coda} package available from CRAN. We also provide a function to plot the trace of the parameters as part of the {\tt BGmix} package.\\ Two alternatives are provided for assessing error rates. With the mixture model, an estimate of the false discovery rate (FDR) based on posterior probabilities can be calculated. For unstructured priors, a tail posterior probability method (\cite{bochkina07}) can be used.\\ The input to the model can be expression data processed by any algorithm. We provide a function {\tt readBGX} to read in the output from the package {\tt BGX}, which is a fully Bayesian hierarchical model for obtaining gene expression measures (\cite{bgx}). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\section{Installation}\label{sec:install} %Installation is as standard for linux, eg. {\tt install.packages("BGmix_**.tar.gz"}, %or if you download the zip file first you can use {\tt R CMD install BGmix_**.tar.gz} %or {\tt install.packages("BGmix_**.tar.gz",repos=NULL}. %Currently the package relies on the GNU libraries GSL and GSL CBLAS %(these are fairly standard libraries). %They must be installed on your system for BGmix to work. BGmix should find the %libraries if they are installed. If they are not, you can download them from here: %http://www.gnu.org/software/gsl/. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Data format}\label{sec:data} Data input to {\tt BGmix} consists of sample mean and sample variance for each gene, under each experimental condition. Three R objects are required as arguments to the {\tt BGmix} function: \begin{itemize} \item {\tt ybar}: a matrix, whose columns correspond to experimental conditions and rows correspond to genes. Each column contains sample means for all genes under one condition. \item {\tt ss}: a matrix, whose columns correspond to experimental conditions and rows correspond to genes. Each column contains sample variances for all genes under one condition. Sample variances must be the unbiased estimates, i.e. divide by no. replicates - 1 (this is the default for the standard R {\tt var} function). \item {\tt nreps}: a vector containing the number of replicates in each condition. \end{itemize} Note that for a paired design, the data is treated as having only one `condition', and {\tt ybar} is then the mean \emph{difference} between the two experimental conditions.\\ The data must be transformed so that Normal sampling errors are a reasonable assumption (eg. with a log or shifted-log transform), and normalised if necessary. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Experimental design}\label{sec:design} The first level of the model can be written as a regression for each gene: \be \bar{y}_{g} = X^T.\beta_g + \epsilon_g \ee where $\bar{y}_{g}$ is the vector of sample means for different conditions and $\beta_g$ is a vector of effect parameters. Different parametrisations can be achieved using the `design' matrix $X$. At most 1 effect parameter can have a mixture prior. (This will generally be the differential expression parameter.) \\ By default, genes have a separate variance parameter for each condition ($\sigma^2_{gc}$). However, a more general variance structure can be used, for instance each gene can have one variance across all conditions ($\sigma^2_{g}$). \\ The {\tt BGmix} function takes four arguments relating to the parametrisation: \begin{itemize} \item {\tt xx}: design matrix X. The dimensions of $X$ must be no. effects x no. conditions. \item {\tt jstar}: label of the effect which has the mixture prior. Labels start at 0, since this is passed to C++. If all parameters are fixed effects, set {\tt jstar} = -1. \item {\tt ntau}: the number of variance parameters for each gene. \item {\tt indtau}: label for each condition indicating which variance grouping that condition belongs to. The length of {\tt indtau} must be the same as the number of conditions. \end{itemize} The defaults for these parameters are those for the \textbf{differential expression, unpaired data} case (see below). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Differential expression, unpaired data} For unpaired data $\beta_{g1}$ is the overall mean for gene $g$ and $\beta_{g2}$ is the differential expression parameter. Here $X =\left( \begin{array}{cc} 1&1 \\ -1/2 & 1/2 \end{array}\right)$, {\tt jstar} = 1. Two variance structures are commonly used: for gene variances per condition ($\sigma^2_{gc}$), use {\tt ntau} = 2, {\tt indtau} = 0:1. For one variance across all conditions ($\sigma^2_{g}$), use {\tt ntau} = 1, {\tt indtau} = 0. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Differential expression, paired data} For paired data there is only one condition and one effect, which is the differential expression parameter. Here $X = 1$, {\tt jstar} = 0, {\tt ntau} = 1, {\tt indtau} = 0. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Multi-class data} If one fixed effect per condition is required, set $X$ to be the identity matrix and {\tt jstar} = -1. For gene variances per condition ($\sigma^2_{gc}$), use {\tt ntau} = no. conditions, {\tt indtau = 0:(nconds-1)}. For one variance across all conditions ($\sigma^2_{g}$), use {\tt ntau} = 1, {\tt indtau} = 0. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Example: How to run the model} At a minimum, you must consider the data set and experimental design parameters in order to run the model (see Sections \ref{sec:data} and \ref{sec:design}). \\ We demonstrate BGmix on a small simulated data set. This consists of 8 replicates of 1000 genes in 2 experimental conditions. We look for differential expression between the two conditions, with an unpaired design. \\ First read in the data: <>= library(BGmix) data(ybar,ss) @ The default experimental design parameters are those for unpaired differential expression, so these can be left out here. The following command fits BGmix using a mixture of a point mass at zero for the null distribution and a Gamma and a reflected Gamma for the alternatives. <<>>= outdir <- BGmix(ybar, ss, nreps=c(8,8),niter=1000,nburn=1000) @ The function {\tt BGmix} returns the output directory name. The output directory contains several types of file: \begin{itemize} \item summary of model options (summary.txt) \item posterior means (mean*.txt) \item probability of being classified in the null component (prob-class.txt) \item trace of posterior distribution (trace*.txt) \item predictive p-values (pval*.txt) \end{itemize} Output data can be read into R with the functions {\tt ccSummary} , {\tt ccParams} (this reads in posterior means and classification probabilities), {\tt ccTrace} and {\tt ccPred}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Plotting the results} First read in posterior means: <<>>= params <- ccParams(file=outdir) @ The output of {\tt ccParams} is a list of vectors and matrices corresponding to the different model parameters. These are easily plotted using standard R functions. For an unpaired differential expression design some standard plots are included in the package. These show smoothing of parameters and classification of genes into different mixture components: <>= plotBasic(params,ybar,ss) @ The estimated FDR (false discovery rate) can also be plotted: <>= par(mfrow=c(1,2)) fdr <- calcFDR(params) plotFDR(fdr) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Predictive model checking} First read in the predictive p-values: <<>>= pred <- ccPred(file=outdir) @ It is a good idea to look at histograms of the predictive p-values for the gene variances: <>= par(mfrow=c(1,2)) hist(pred$pval.ss.mix[,1]) hist(pred$pval.ss.mix[,2]) @ For mixture models, there is a specific function to plot histograms of predictive p-values corresponding to each of the mixture components. <>= par(mfrow=c(2,3)) plotPredChecks(pred$pval.ybar.mix2,params$pc,probz=0.8) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Tail posterior probability} Tail posterior probability is used to find differentially expressed genes with unstructured prior for the difference (fixed effects). It needs trace and parameters output from BGmix with {\tt jstar = -1} (all effects are fixed): <>= nreps <- c(8,8) outdir2 <- BGmix(ybar, ss, nreps=nreps, jstar=-1, niter=1000,nburn=1000) params2 <- ccParams(outdir2) res2 <- ccTrace(outdir2) @ and the tail posterior probability is calculated by <<>>= tpp.res <- TailPP(res2, nreps=nreps, params2, p.cut = 0.7, plots = F) @ Note that in this function the tail posterior probability is calculated only for the second contrast, assuming that it is the difference between condition 2 and condition 1 for the default contrast matrix $X$ (see Section 3.1), or for the first contrast in paired data.\\ The returned values are the tail posterior probabilities {\tt tpp}, estimated False Discovery Rate {\tt FDR} and estimated proportion of non-differentially expressed genes {\tt pi0}. FDR and {\tt pi0} can also be estimated separately: <<>>= FDR.res <- FDRforTailPP(tpp.res$tpp, a1 = params2$maa[1], a2 = params2$maa[2], n.rep1=nreps[1], n.rep2=nreps[2], p.cut = 0.8) pi0 <- EstimatePi0(tpp.res$tpp, tpp.res$pp0) @ The histogram of the tail posterior probabilities with its density under the null (no differentially expressed genes) and a graph of FDR can be plotted. (These plots can be done in function {\tt TailPP} by setting arguments {\tt plots = TRUE}.) <>= par(mfrow=c(1,2)) histTailPP(tpp.res) FDRplotTailPP(tpp.res) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Model options} To run the model with a flat prior on all effects (no mixture prior), set {\tt jstar} = -1. There are three main choices for the mixture prior, as presented in \cite{lewin07}. These are controlled by the option {\tt move.choice.bz} in {\tt BGmix}: \begin{itemize} \item Null point mass, alternatives Uniform ({\tt move.choice.bz = 1}) \item Null point mass, alternatives Gamma ({\tt move.choice.bz = 4}) \item Null small Normal, alternatives Gamma ({\tt move.choice.bz = 5}) \end{itemize} There are two alternatives for the prior on the gene variances. These are controlled by the option {\tt move.choice.tau} in {\tt BGmix}: \begin{itemize} \item Inverse Gamma ({\tt move.choice.tau = 1}) \item log Normal ({\tt move.choice.tar = 2}) \end{itemize} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Other functions} {\tt plotTrace} plots trace plots of model parameters (useful for assessing convergence of the MCMC) \\ {\tt plotCompare} produces scatter plot of two variables using the same scale for x and y axes\\ {\tt plotMixDensity} plots predictive density for mixture model (Note: you must save the trace of the predicted data for this: option {\tt trace.pred=1} in {\tt BGmix} and option {\tt q.trace=T} in {\tt ccPred}) \\ {\tt TailPP} plots the tail posterior probability (for use with unstructured priors on the effect parameters) \\ {\tt readBGX} reads in results from the {\tt BGX} package. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% this bit removes the output directory so not left in package <>= unlink("run.1", recursive=TRUE) unlink("run.2", recursive=TRUE) @ \section{Acknowledgements} Thanks to Ernest Turro for invaluable help getting the package to work. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{thebibliography}{99} \bibitem[{Bochkina \& Richardson(2007)}]{bochkina07} Bochkina, N. and Richardson, S. (2007). \newblock Tail posterior probability for inference in pairwise and multiclass gene expression data. \newblock {\em Biometrics} in press. \bibitem[{Hein {et~al.}(2005)Hein, Richardson, Causton, Ambler, \& Green}]{bgx} Hein, A.-M.~K., Richardson, S., Causton, H.~C., Ambler, G.~K., and Green, P.~J. (2005). \newblock BGX: a fully Bayesian gene expression index for Affymetrix GeneChip data. \newblock {\em Biostatistics} {\bf 6(3)}, 349--373. \bibitem[{Lewin {et~al.}(2007)Lewin, Bochkina, \& Richardson}]{lewin07} Lewin, A.~M., Bochkina, N. and Richardson, S. (2007). \newblock Fully Bayesian mixture model for differential gene expression: simulations and model checks. \newblock {\em submitted}. \end{thebibliography} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \end{document}