% \VignetteIndexEntry{ A Short Introduction to the OmicMarkeR Package} % \VignettePackage{OmicsMarkeR} %\VignetteEngine{knitr::knitr} % To compile this document % library('knitr'); rm(list=ls()); knit('OmicsMarkeR.Rnw') \documentclass[12pt]{article} <>= BiocStyle::latex() @ \bioctitle{A Short Introduction to the \Biocpkg{OmicsMarkeR} Package} \author{Charles Determan Jr.\footnote{cdetermanjr@gmail.com}} \begin{document} \maketitle \thispagestyle{empty} \maketitle \section{Introduction} The \Biocpkg{OmicsMarkeR} package contains functions to streamline the analysis of 'omics' level datasets with the objective to classify groups and determine the most important features. \Biocpkg{OmicsMarkeR} loads packages as needed and assumes that they are installed. I will provide a short tutorial using the both synthetic datasets created by internal functions as well as the 'Sonar' dataset. Install \Biocpkg{OmicsMarkeR} using <>= if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("OmicsMarkeR") @ to ensure that all the needed packages are installed. \newpage \maketitle \section{Basic Classification Example} \Biocpkg{OmicsMarkeR} has a few simplified functions that attempt to streamline the classification and feature selection process including the addition of stability metrics. We will first generate a synthetic dataset. This includes three functions that can be used to create multivariate datasets that can mimic specific omics examples. This can include a null dataset via the \Rfunction{create.rand.matrix} to create a random multivarate dataset with \Rcode{nvar = 50} and \Rcode{nsamp = 100}. The \Rfunction{create.corr.matrix} function induces correlations to the datasets. The \Rfunction{create.discr.matrix} function induces variables to be discriminate between groups. The number of groups can be specified with \Rcode{num.groups}. <>= library("OmicsMarkeR") set.seed(123) dat.discr <- create.discr.matrix( create.corr.matrix( create.random.matrix(nvar = 50, nsamp = 100, st.dev = 1, perturb = 0.2)), D = 10 ) @ To avoid confusion in the coding, one can isolate the variables and classes from the newly created synthetic dataset. These two objects are then used in the \Rfunction{fs.stability} function. I can then choose which algorithm(s) to apply e.g. \Rcode{method = c("plsda", "rf")}, the number of top important features \Rcode{f = 20}, the number of bootstrap reptititions for stability analysis \Rcode{k = 3}, the number of k-fold cross-validations \Rcode{k.folds = 10} and if I would like to see the progress output \Rcode{verbose = TRUE}. \warning{You will receive warnings if you run code exactly as shown in this vignette. This is intentional as the PLSDA runs with this simple dataset often only need a single component but you need to indicate 2 to fit the model. This is provided for the users information.} <>= vars <- dat.discr$discr.mat groups <- dat.discr$classes fits <- fs.stability(vars, groups, method = c("plsda", "rf"), f = 10, k = 3, k.folds = 10, verbose = 'none') @ If I would like to see the performance metrics, I can simply use the \Rfunction{performance.metrics} function. This will provide a concise data.frame of confusion matrix and ROC statistics. Additionally, the Robustness-Performance Trade-off value (RPT) is provided with the results. <>= performance.metrics(fits) fits$RPT @ If I would want to see the occurance of the features the model identified as the most important, this is accomplished by \Rfunction{feature.table}. This function returns a simple table reporting the consistency (i.e. how many times identified) and frequency (percent identified in all runs). <>= feature.table(fits, "plsda") @ If the user is interesting in applying the fitted model (determined by \Rfunction{fs.stability}) towards some new data this can be accomplished with \Rfunction{predictNewClasses}. This could either be yet another level to evaluate the tuned model's performance or if the user is applying the model in a production type setting where you are systematically using this model on new data that comes in. <>= # create some 'new' data newdata <- create.discr.matrix( create.corr.matrix( create.random.matrix(nvar = 50, nsamp = 100, st.dev = 1, perturb = 0.2)), D = 10 )$discr.mat # original data combined to a data.frame orig.df <- data.frame(vars, groups) # see what the PLSDA predicts for the new data # NOTE, newdata does not require a .classes column predictNewClasses(fits, "plsda", orig.df, newdata) @ \textbf{Note} - This function is not as efficient as I would like at present. Currently it requires the original dataset to refit the model from the parameters retained from \Rfunction{fs.stability}. I intend to provide the option to retain fitted models so a user can simply pull them without a need to refit. However, I am concerned about the potential size of objects (e.g. random forest, gbm, etc.). Thoughts and contributions are welcome. \newpage \maketitle \section{Ensemble Methods} In machine learning ensembles of models can be used to obtain better predictive performance than any individual model. There are multiple types of ensemble types including the bayes optimal classifier, bayesian model averaging (BMA), bayesian model combination (BMC), boostrap aggregation (bagging), and boosting. Although it is the intention to include all the methods the only currently implemented method is bagging. Bagging, in the simplest terms, is defined as giving a set of trained models equal weight when 'voting' on an optimal solution. The most familiar application of this concept is in the random forest algorithm. This technique requires a defined aggregation technique to combine the set of models. Implemented methods include Complete Linear Aggregation (\Rfunction{CLA}), Ensemble Mean (\Rfunction{EM}), Ensemble Stability (\Rfunction{ES}), and Ensemble Exponential (\Rfunction{EE}). To conduct an ensemble analysis, the code is nearly identical to the first example in Section 1. The \Rfunction{fs.ensembl.stability} function contains the same arguments as \Rfunction{fs.stability} in addition to a few more for the ensemble components. Please see \Rcode{?fs.ensemble.stability} for complete details on each. The two major additional parameters are \Robject{bags} and \Robject{aggregation.metric}. The \Robject{bags} parameter naturally defines the number of bagging iterations and the \Robject{aggregation.metric} is a string defining the aggregation method. These arguments have common defaults predefined so the call can be: <>= fits <- fs.ensembl.stability(vars, groups, method = c("plsda", "rf"), f = 10, k = 3, k.folds = 10, verbose = 'none') @ As in Section 1, the \Rfunction{performance.metrics} function can be applied for summary statistics. \newpage \subsection{Aggregation Methods} If the user wishes to apply an aggregation method manually utilizing results from an alternative analysis, this can also be done. This package provides a the wrapper \Rfunction{aggregation} to apply this analysis. For these methods, the variables must have been ranked and the rownames assigned as the variable names. The function then will return that aggregated list of variables with their new respective ranks. <>= # test data ranks <- replicate(5, sample(seq(50), 50)) row.names(ranks) <- paste0("V", seq(50)) head(aggregation(ranks, "CLA")) @ This is used internally in \Rfunction{fs.ensembl.stability} to optimize which variables are selected to be included in the final optimized model. The only exception to the format above is with the \Rfunction{EE} function where the number of variables must be defined with \Robject{f}. \newpage \maketitle \section{Custom Tuning} The default implementation assumes that each model will be tuned with the same resolution of tuning parameters. For example, a default call with PLSDA and Random Forest will result in tuning 3 components and 3 mtry for each respectively. However, let's say I want to be more fine tuned with my random forest model. You can create a customized grid using \Rfunction{denovo.grid}. <>= # requires data.frame of variables and classes plsda <- denovo.grid(orig.df, "plsda", 3) rf <- denovo.grid(orig.df, "rf", 5) # create grid list # Make sure to assign appropriate model names grid <- list(plsda=plsda, rf=rf) # pass to fs.stability or fs.ensemble.stability fits <- fs.stability(vars, groups, method = c("plsda", "rf"), f = 10, k = 3, k.folds = 10, verbose = 'none', grid = grid) @ The user can create their own grid completely manually but must use the appropriate names as defined by the functions. These can be check with \Rfunction{params}. As an example: \Rcode{params{method="plsda"}}. \textbf{Note} - the argument names must be preceded by a period. This is to prevent any unforseen conflicts in the code. \newpage \maketitle \section{Stability Metrics} \subsection{Stability Metric Basics} It is quite possible that a user may already have fitted a model previously or using a model that has not yet been implemented in this package. However, they may be interested in applying one or more of the stability metrics defined within this package. These functions are very simple to use. To demonstrate, let's create some sample data consisting of our \textbf{Metabolite Population}. <>= metabs <- paste("Metabolite", seq(20), sep="_") @ Now, let's say you have run you different special model twice on different samples of a dataset. You complete your feature selection and get two lists of Metabolites. Here I am just randomly sampling. <>= set.seed(13) run1 <- sample(metabs, 10) run2 <- sample(metabs, 10) @ The user can now evaluate how similar the sets of metabolites selected are via multiple possible stability metrics. These include the Jaccard Index (\Rfunction{jaccard}), Dice-Sorensen (\Rfunction{sorensen}), Ochiai's Index (\Rfunction{ochiai}), Percent of Overlapping Features (\Rfunction{pof}), Kuncheva's Index (\Rfunction{kuncheva}), Spearman (\Rfunction{spearman}), and Canberra (\Rfunction{canberra}). The latter two methods are not \emph{Set Methods} and do require the same number of features in each vector. The relevant citations are provided in each function's documenation. The general use for most of the functions is: <>= jaccard(run1, run2) @ The exception to this is Kuncheva's Index. This requires one additional parameter, the number of features (e.g. metabolites) in the original dataset. This metric is designed to account for smaller numbers of variables increasing the likelihood of matching sets by chance. Naturally, if you have many more variables it would be far more indicative of significance if you see the same small subset again and again as opposed to a small set seeing the same variables. <>= # In this case, 20 original variables kuncheva(run1, run2, 20) @ \newpage \subsection{Pairwise Stability} \subsubsection{Pairwise Feature Stability} The above examples immediately lead to the question, what if I have more than two runs? What I have 3, 5, 10, or more bootstrap iterations? This is also know as a data perturbation ensemble approach. It would be very tedious to have to call the same function for every single comparison. Therefore a pairwise function exists to allow a rapid comparison between all sets. This \Rfunction{pairwise.stability} is very similar to the individual stability functions in practice. Let's take an example consisting of 5 runs. <>= set.seed(21) # matrix of Metabolites identified (e.g. 5 trials) features <- replicate(5, sample(metabs, 10)) @ Please note that currently only \Rclass{matrix} objects are accepted by \Rfunction{pairwise.stability}. To use the function, you simply pass your matrix of variables and stability metric (e.g. sorensen). The only exception is when applying Kuncheva's Index where the \Robject{nc} parameter again must be set (which can be ignored otherwise). This will return all list containing the upper triangular matrix of stability values and an overall average. <>= pairwise.stability(features, "sorensen") @ \newpage \subsubsection{Pairwise Model Stability} Now, in the spirit of this package, you may have the alternate approach whereby you have created several bootstrapped data sets and run a different statistical model on each data set. You could compare each one manually, but again to avoid such tedious work another function is provided for specifically this purpose. Let's take a theoretical example where I have bootstrapped 5 different data sets and applied two models to each dataset (PLSDA and Random Forest). Please note that currently only \Rclass{list} objects are accepted by \Rfunction{pairwise.model.stability}. \textbf{Note} - here I am only randomly sampling but in practice the each model would have been trained on the same dataset. <>= set.seed(999) plsda <- replicate(5, paste("Metabolite", sample(metabs, 10), sep="_")) rf <- replicate(5, paste("Metabolite", sample(metabs, 10), sep="_")) features <- list(plsda=plsda, rf=rf) # nc may be omitted unless using kuncheva pairwise.model.stability(features, "kuncheva", nc=20) @ \newpage \section{Permuation Analysis} One additional level of analysis often applied to these datasets is Monte Carlo Permuations. For example, I would like to check the chance that my data can distinguish between the groups by chance. This can be done by permuting the groups in the dataset and applying the model on each permuation. This can be accomplished with \Rfunction{perm.class}, which also provides a plot of the classification distribution. Additionally, one may be interested in another way to evaluate the importance of variables to the distinguishing the groups. Once again, groups can be permuted, the model refit to the data and the importance of the variables evaluated. This is accomplished with \Rfunction{perm.features}. <>= # permuate class perm.class(fits, vars, groups, "rf", k.folds=5, metric="Accuracy", nperm=10) # permute variables/features perm.features(fits, vars, groups, "rf", sig.level = .05, nperm = 10) @ \newpage \section{Parallel Analysis} Given the repetitive nature of this analysis there are ample opportunities to level the power of parallel computing. These include \Rfunction{fs.stability}, \Rfunction{fs.ensembl.stability}, \Rfunction{perm.class}, and \Rfunction{perm.features} simply by specifying the parameter \Rcode{allowParallel = TRUE} in the respective function. However, the parallel backend must be registered in order to work. There are slight differences between operating systems so here are two examples. For Unix OS, you probably will use \CRANpkg{doMC} <>= library(doMC) n <- detectCores() registerDoMC(n) @ For a Windows OS, you likely with use the \CRANpkg{doSNOW} <>= library(parallel) library(doSNOW) # get number of cores n <- detectCores() # make clusters cl <- makeCluster(n) # register backend registerDoSNOW(cl) @ \textbf{NOTE} - remember to stop your clusters on Windows when you are finished with \Rcode{stopCluster(cl)}. \newpage <>= sessionInfo() @ \end{document}