%\VignetteEngine{knitr::knitr}

\documentclass[a4paper, 9pt]{article}


<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

%\VignetteIndexEntry{An R Package for TRanslational ONCOlogy}

\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage{placeins}
\usepackage{url}
\usepackage{tcolorbox}
\usepackage{authblk}

\begin{document}

\title{Using the \Biocpkg{TRONCO} package}

\author[1]{Marco Antoniotti}
\author[2]{Giulio Caravagna}
\author[1]{Luca De Sano}
\author[1]{Alex Graudenzi}
\author[1]{Giancarlo Mauri}
\author[3]{Bud Mishra}
\author[4]{Daniele Ramazzotti}

\affil[1]{Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi Milano Bicocca Milano, Italy}
\affil[2]{School of Informatics, Edinburgh University, Edinburgh, UK}
\affil[3]{Courant Institute of Mathematical Sciences, New York University, New York, USA}
\affil[4]{Department of Pathology, Stanford University, Stanford, CA, USA}

\date{\today}
\maketitle

\begin{tcolorbox}{\bf Overview.} The \Biocpkg{TRONCO} ({\sc TR}{\em anslational} {\sc ONCO}{\em logy})  \R{} package  collects algorithms to infer  progression models via the approach of Suppes-Bayes Causal Network, both from an ensemble of tumors (cross-sectional samples) and within an individual patient (multi-region or single-cell samples). The package provides parallel implementation of algorithms that process binary matrices where each row represents a tumor sample and each column a single-nucleotide or a structural variant driving the  progression; a 0/1 value models the absence/presence of that alteration in the sample. The tool can import data from plain,  MAF or GISTIC format files, and can fetch it from the cBioPortal for cancer genomics. Functions for  data manipulation and visualization are provided, as well as functions to import/export such data to other bioinformatics  tools for, e.g,  clustering or detection of mutually exclusive alterations. Inferred models can be visualized and tested for their confidence via bootstrap and cross-validation. \Biocpkg{TRONCO} is used for  the implementation of the Pipeline for Cancer Inference. \\

\vspace{1.0cm}

{\em In this vignette, we will  give an overview of the package by presenting some of the functions that could be most commonly used to arrange a data-analysis pipeline, along with their parameters to customize \Biocpkg{TRONCO}'s functioning. Advanced example case studies  are available at the tool webpage}

\vspace{1.0cm}

\renewcommand{\arraystretch}{1.5}

\begin{tabular}{ll}
{\bf Contact.}& \email{tronco@disco.unimib.it} \\
{\bf Bugs report.} & \url{https://github.com/BIMIB-DISCo/TRONCO} \\
{\bf Website.} & \url{https://sites.google.com/site/troncopackage}
\end{tabular}
\end{tcolorbox}


<<include=FALSE>>=
library(knitr)
opts_chunk$set(
concordance = TRUE,
background = "#f3f3ff"
)
@


\newpage

\tableofcontents


\section{Changelog} 

\begin{itemize}
\item[2.8.1] Minor fix on documentation.
\item[2.7.7] RNA Seq validation. Random restart on Hill Climbing added to CAPRI algorithm. Minor fixes to algorithms and error model.
\item[2.7.3] Development version. Assignment to .GlobalEnv removed.
\item[2.6.1] Current stable version.
\item[2.5.3] New algorithms: Edmonds, Gabow, Chow-Liu and Prim. New scores: PMI, CPMI, MI. 
\item[2.4.3] Bugfix. 
\item[2.4.2] Implements a noise model and finalizes a series of algorithms reconstructing Suppes-Bayes Causal Network as maximum spanning trees.
\item[2.4] New statistics available for model confidence via cross-validation routines. New algorithms based on Minimum Spanning Tree extraction.
\item[2.0] Released in summer 2015 on our GitHUB, replaced the \Bioconductor{} version in autumn 2015. This version is parallel, includes also the CAPRI algorithm, supports common GISTIC and  MAF input formats, supports TCGA samples editing and queries to the cBio portal. This version has new plotting capabilities, and a general from-scratch design. It is not compatible with previous releases.
\item[1.0] released in mid 2014, includes CAPRESE algorithm. It is now outdated and no more maintained;
\end{itemize}


\section{Algorithms and useful links} \label{sec:stuff}

\renewcommand{\arraystretch}{2}

\begin{center}
\begin{tabular}{l | p{5.0cm} | l  | p{6.0cm}}
{\bf Acronym} & {\bf Extended name} & {\bf App.} & {\bf Reference}\\ \hline

CAPRESE &  Cancer Progression Extraction with Single Edges &  Ind &  \href{http://www.ncbi.nlm.nih.gov/pubmed/25299648}{PLoS ONE, 9(10):e108358, 2014.}
\\ \hline

CAPRI &  Cancer Progression Inference &  Ens & \href{http://www.ncbi.nlm.nih.gov/pubmed/25971740}{Bioinformatics 31(18), 3016-3016, 2015.}\\ \hline

Edmond & Directed Minimum Spanning Tree with Mutual Information & Ind & \href{https://www.biorxiv.org/content/early/2017/09/04/132183}{Publication.}\\ \hline
Gabow & Partially Directed Minimum Spanning Tree with Mutual Information & Ind & \href{https://www.biorxiv.org/content/early/2017/09/04/132183}{Publication.}\\ \hline
Chow Liu & Undirected Minimum Spanning Tree with Likelihood-Fit & Ind & \href{https://www.biorxiv.org/content/early/2017/09/04/132183}{Publication.}\\ \hline
Prim & Undirected Minimum Spanning Tree with Mutual Information & Ind & \href{https://www.biorxiv.org/content/early/2017/09/04/132183}{Publication.}\\ \hline
 \end{tabular}
\end{center}
{\small {\bf Legend.} Ens.: ensemble-level with cross-sectional data; Ind.: individual-level with single-cell or multi-region data.}


\paragraph{External links to resources related to TRONCO.}

\begin{itemize}
\item  TRONCO was introduced in \href{https://academic.oup.com/bioinformatics/article/32/12/1911/1743307}{De Sano, Luca, et al. "TRONCO: an R package for the inference of cancer progression models from heterogeneous genomic data." Bioinformatics 32.12 (2016): 1911-1913}. 
\item  TRONCO since version {\bf 2.3} is used to implement the {\bf Pipeline For Cancer Inference (PiCnIc)} described in \href{http://www.pnas.org/content/113/28/E4025}{Caravagna, Giulio, et al. "Algorithmic methods to infer the evolutionary trajectories in cancer progression." Proceedings of the National Academy of Sciences 113.28 (2016): E4025-E4034}. 
\item Case studies featuring Atypical Chronic Myeloid Leukemia,  Colorectal Cancer, Clear Cell Renal Cell Carcinoma  and others are available at the tool \href{https://sites.google.com/site/troncopackage/}{webpage}. Code for replication of each of those study is made available through  \href{https://github.com/BIMIB-DISCo}{Bioinformatics Milano-Bicocca's Github}.


\end{itemize}


\section{Loading data}

\paragraph{Preliminaries.} 


<<req>>=
library(TRONCO)
data(aCML)
data(crc_maf)
data(crc_gistic)
data(crc_plain)
@


TRONCO transforms input data in a sort of database-alike format, where three main fields are presente: {\tt genotypes} which contains the genomic signatures of the input samples, {\tt annotations} which provides an index to the events present in the data and {\tt types}, a field mapping type of events (e.g., mutations, CNAs, etc.) to colors for display visualization. Other annotations are generated when a dataset is augmented with some metadata.  A TRONCO object shall be edited by using TRONCO functions,  to avoid to create inconsistencies in its internal representation. Function \Rfunction{is.compliant} can be used to test if a TRONCO object is consistent; the function is called by any TRONCO function before returning  a modified object, so to ensure that consistency is preserved -- \Rfunction{is.compliant}  will raise an error if this is not the case.

TRONCO supports the import of data from 3 formats. The Mutation Annotation Format (\textit{MAF}) is a tab-delimited file containing somatic and/or germline mutation annotations; the \textit{GISTIC} format for copy number alterations as defined by TCGA and a custom boolean matrix format where the user can directly specify the mutational profiles to be importend.   Through some data included in the package we will show how to load your datasets in TRONCO.  
 \begin{itemize}
\item[{\tt aCML}]  a TRONCO object that represents the {\em atypical Chronic Myeloid Leukemia} dataset by Piazza {\em et al.} (Nat. Gen. 2013 45(1):18-24).
\item[{\tt crc\_maf}]  a shortened version of the {\em colorectal cancer mutation data}   made available by the TCGA consortium within the COADREAD project\footnote{See \url{https://tcga-data.nci.nih.gov/docs/publications/coadread\_2012/} and our PicNiC case study (\S \ref{sec:stuff}) for the real analysis of such data.}
 \item[{\tt crc\_gistic}] from the same   TCGA  project, we also provide a shortened version of the focal CNAs in the GISTIC format where 1 represents a low level gain, 2 a high level gain, -1 a heterozygous loss of a gene and -2 its homozygous loss.
\item[{\tt crc\_plain}]  a custom boolean matrix where rows are samples, and columns represent events -- in this case alterations  in a certain gene. Notice with this format one could also custom types of alterations, for instance wider chromosomal aberrations or, in principle, epigenetic states (over-expression, methylated regions, etc.) that are persistent across tumor evolution. 


\end{itemize}

 
Whatever is dataset created as explained in the next sections, it can be annotated by adding a mnemonic description of the data, which will be used as plot titles when possible.
Function \Rfunction{annotate.description} raises a warning if the dataset was previously annotated.


aCML = annotate.description(aCML, 'aCML data (Bioinf.)')


\subsection{Mutations annotated in a MAF format}
We use the function \texttt{import.MAF} to import a dataset in MAF format, in this case the following  TCGA dataset

<<>>=
head(crc_maf[, 1:10])
@


A default importation is done  without adding parameters to \texttt{import.MAF}. In this case, all mutations per gene will be considered equivalent, regardless of the type that is annotated in the MAF.  Also, all genes will be imported, and all samples.


<<import.MAF>>=
dataset_maf = import.MAF(crc_maf)
@


See \S \ref{sec:view} to understand how to visualize a TRONCO dataset. In the above case -- where we see that mutations are annotated as {\tt Missense\_Mutation} or {\tt Nonsense\_Mutation}, if a gene in a sample has both, these will be merged to a unique {\tt Mutation} type. In this case a pair gene name with {\tt Mutation} will be what we call an ``event'' in our dataset -- e.g.,  APC {\tt Mutation}.

If one would like  to have two distinct events in the dataset, i.e., APC {\tt Missense\_Mutation} and APC {\tt Nonsense\_Mutation}, parameter {\tt merge.mutation.types} should be set to false in the call to \Rfunction{import.MAF}. 


<<import.MAF.real>>=
dataset_maf = import.MAF(crc_maf, merge.mutation.types = FALSE)
@


Sometimes, we might want to filter out some of the entries in a MAF -- maybe restricting the type of genes, mutations or sample that we want to process. If one defines
{\tt filter.fun} as a function that returns {\tt TRUE} only for those entries which shall be considered, he gets a filter process which is applied to each row of the MAF file prior to transforming that into a TRONCO dataset. In this example we select only mutations annotated to APC -- we access that through the {\tt Hugo\_Symbol} flag of a MAF. 

<<import.MAF.box>>=
dataset_maf = import.MAF(crc_maf, filter.fun = function(x){ x['Hugo_Symbol'] == 'APC'} )
@


It is also sometimes convenient -- especially when working with data collected from a single individual patient -- to distinguish the type of mutations and their position in a gene, or if they are somehow annotated to COSMIC or other databases. For instance, we might want to want to use the {\tt MA.protein.change} annotation in the MAF file to get composite names such as TP53.R175H, TP53.R213, TP53.R267W etc. This can be done by setting {\tt paste.to.Hugo\_Symbol} to have the relevant name of the MAF annotation


<<import.MAF.crc_maf>>=
dataset_maf = import.MAF(crc_maf, 
    merge.mutation.types = FALSE, 
    paste.to.Hugo_Symbol = c('MA.protein.change'))
@


TRONCO supports custom  MAF files, where possibly not all the standard annotations are present, via {\tt irregular = TRUE}.


\subsection{Copy Number Variants annotated in the GISTIC format}
We use the function \Rfunction{import.GISTIC} to import a dataset in GISTIC format, in this case from

<<>>=
crc_gistic
@


In its default execution all the data annotated in the file is imported. But in principle  it is possible to avoid to import some   genes or samples; in this case it is sufficient to use parameters {\tt filter.genes} and {\tt filter.samples} for this function.


<<import.GISTIC>>=
dataset_gistic = import.GISTIC(crc_gistic)
@


\subsection{Custom alterations annotated in a boolean matrix}
One can annotate its custom type of alterations in a boolean matrix such as {\tt crc\_plain}

<<>>=
crc_plain
@

In this case,  function \Rfunction{import.genotypes} will convert the matrix to a TRONCO  object where events' names and samples codes will be set from column and row names of the matrix. If this is not possible, these will be generated from templates. By default, the {\tt event.type} is set to {\tt variant} but one can specify a custom name for the alteration that is reported in the matrix

<<import.genotypes>>=
dataset_plain = import.genotypes(crc_plain, event.type='myVariant')
@


\subsection{Downloading data from the cBio portal for cancer genomics}

TRONCO uses the R interface to cBio to query data from the portal. All type of data can be downloaded from the portal, which includes MAF/GISTIC data for a lot of different cancer studies. An example of interaction with the portal is archived at the tool's webpage. 

Here, we  show how to download lung cancer data somatic mutations for genes TP53, KRAS and PIK3CA,  from the lung cancer project run by TCGA, which is archived as  {\tt luad\_tcga\_pub} at cBio. If some of the parameters to \Rfunction{cbio.query} are missing the function will become interactive by showing a list of possible data available at the portal.

<<results='hide', eval=FALSE>>=
data = cbio.query(
    genes=c('TP53', 'KRAS', 'PIK3CA'), 
    cbio.study = 'luad_tcga_pub',
    cbio.dataset = 'luad_tcga_pub_cnaseq', 
    cbio.profile = 'luad_tcga_pub_mutations')
@


\section{Data visualisation}

All examples in this section will be done with the the aCML dataset as reference.

\subsection{Summary report for a dataset and boolean queries}\label{sec:view}

We use the function \texttt{view} to get a short summary of a dataset that we   loaded in TRONCO; this function reports on the number of samples and  events, plus some meta information that could be displayed graphically.  


<<view>>=
view(aCML)
@
%\tcblower
%ss


\subsection{Creating views with the ``as'' functions}

Several functions are available to create views over a dataset, with a set of parameter which can constraint the view -- as in the SELECT/JOIN approaches in databases. In the following examples we  show their execution with the default parameters, but  shorten their output to make this document readable.

The main ``as'' functions  are here documented.  \texttt{as.genotypes}, that we can use to get the matrix of ``genotypes'' that we imported.


<<asgenotypes>>=
as.genotypes(aCML)[1:10,5:10]
@


Differently,  \texttt{as.events} and \texttt{as.events.in.samples}, that show tables with the events that we are processing in  all dataset or in a specific sample that we want to examine.


<<asevents>>=
as.events(aCML)[1:5, ]
as.events.in.sample(aCML, sample = 'patient 2')
@


Concerning genes,  \texttt{as.genes} shows the mnemonic names of the genes (or chromosomes, cytobands, etc.) that we included in our dataset.


<<asgenes>>=
as.genes(aCML)[1:8]
@


And \texttt{as.types} shows the types of alterations (e.g., mutations, amplifications, etc.) that we have  find in our dataset, and  function \texttt{as.colors} shows the list of the  colors which are associated to each type.


<<astypes>>=
as.types(aCML)
as.colors(aCML)
@


A function \texttt{as.gene} can be used to display the alterations of a specific gene  across the samples


<<asgene>>=
head(as.gene(aCML, genes='SETBP1'))
@


Views over samples can be created as well.  \texttt{as.samples} and \texttt{which.samples} list all the samples in the data, or return a list of samples that harbour a certain alteration. The former is 


<<assamples-acml>>=
as.samples(aCML)[1:10]
@


and the latter is
 

<<assamples-acml-tet>>=
which.samples(aCML, gene='TET2', type='Nonsense point')
@


A slightly different function, which manipulates the data,  is \texttt{as.alterations}, which transforms a dataset with events of different type  to events of a unique type, labeled ``Alteration''.


<<asalterations>>=
dataset = as.alterations(aCML)
@
<<asalterations.view>>=
view(dataset)
@


When samples are enriched with stage information function  \texttt{as.stages} can be used to create a view over such table. Views over patterns can be created as well -- see Model Inference with CAPRI.

\subsection{Dataset size}

A set of functions allow to get the number of genes, events, samples, types and patterns in a dataset.


<<number>>=
ngenes(aCML)
nevents(aCML)
nsamples(aCML)
ntypes(aCML)
npatterns(aCML)
@


\subsection{Oncoprints}

Oncoprints are the most effective  data-visualization functions in TRONCO. These are heatmaps where rows represent variants, and columns samples ({\em the reverse} of the input format required by TRONCO), and are annotated and displayed/sorted to enhance which samples have which mutations etc. 

By default \Rfunction{oncoprint}  will try to  sort samples and events to enhance exclusivity patterns among the events.


<<onco, fig.show='hide', fig.width=6, fig.height=5>>=
oncoprint(aCML)
@


But the sorting mechanism is bypassed if one wants to cluster samples or events, or if one wants to split samples by cluster (not shown). In the clustering case, the ordering is given by the dendrograms. In this case we also show the annotation of some groups of events via parameter {\tt gene.annot}.


<<oncocl, fig.show='hide', fig.width=5, fig.height=5,results='hide'>>=
oncoprint(aCML, 
    legend = FALSE, 
    samples.cluster = TRUE, 
    gene.annot = list(one = list('NRAS', 'SETBP1'), two = list('EZH2', 'TET2')),
    gene.annot.color = 'Set2',
    genes.cluster = TRUE)
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.5\textwidth]{figure/onco-1}
\includegraphics[width=0.45\textwidth]{figure/oncocl-1}
\end{center}
\caption{Two different calls to \texttt{oncoprint} with  aCML data. This plot gives a graphical visualization of the events that are in the dataset -- with a color per event type -- but in left it sorts samples to enhance exclusivity patterns among the events, while in right it clusters samples/events.}
\end{figure*}


Oncoprints can be annotated; a special type of annotation is given by stage data. As this is not available for the aCML dataset, we create it randomly, just for the sake of showing how the oncoprint is enriched with this information. This is the random stage map that we create -- if some samples had no stage a NA would be added automatically. 


<<stages>>=
stages = c(rep('stage 1', 32), rep('stage 2', 32))
stages = as.matrix(stages)
rownames(stages) = as.samples(aCML)
dataset = annotate.stages(aCML, stages = stages)
has.stages(aCML)
head(as.stages(dataset))
@


The \Rfunction{as.stages} function can now be used to create a view over stages.

<<>>=
head(as.stages(dataset))
@


After that the data is annotated via \Rfunction{annotate.stages} function, we can again plot an oncoprint -- which this time will detect that the dataset has also stages associated, and will diplay those 


<<onco-stages, fig.show='hide', fig.width=6, fig.height=5, results='hide'>>=
oncoprint(dataset, legend = FALSE)
@


If one is willing to display samples grouped according to some variable, for instance after a sample clustering task, he can use {\tt group.samples} parameter of \Rfunction{oncoprint} and that will override the mutual exclusivity ordering. Here, we make the trick of using the stages as if they were such clustering result.


<<onco-clus2, fig.show='hide', fig.width=6, fig.height=5, results='hide'>>=
oncoprint(dataset, group.samples = as.stages(dataset))
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.45\textwidth]{figure/onco-stages-1}
\includegraphics[width=0.45\textwidth]{figure/onco-clus2-1}
\end{center}
\vspace*{.05in}
\caption{Example \texttt{oncoprint} output for aCML data with randomly annotated stages, in left, and samples clustered by group assignment in right -- for simplicity the group variable is again the stage annotation.}
\end{figure*}


\FloatBarrier


\subsection{Groups visualization (e.g., pathways)}

TRONCO provides functions to visualize groups of events, which in this case are called pathways -- though this could be any group that one would like to define. Aggregation happens with the same rational as the \Rfunction{as.alterations} function, namely by merging the events in the group.

We make an example of a pathway called {\tt MyPATHWAY} involving genes SETBP1, EZH2 and WT1; we want it to be colored in red, and we want to have the genotype of each event to be maintened in the dataset. We proceed as follows  (R's output is omitted).

<<pathway>>=
pathway = as.pathway(aCML,
    pathway.genes = c('SETBP1', 'EZH2', 'WT1'),
    pathway.name = 'MyPATHWAY',
    pathway.color = 'red',
    aggregate.pathway = FALSE)
@

Which we then visualize with an oncoprint

<<onco-pathway, fig.show='hide', fig.width=6.5, fig.height=2, results='hide'>>=
oncoprint(pathway, title = 'Custom pathway',  font.row = 8, cellheight = 15, cellwidth = 4)
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.9\textwidth]{figure/onco-pathway-1}
\end{center}
\vspace*{.05in}
\caption{\texttt{oncoprint} output of a custom pathway called {\tt MyPATHWAY} involving genes SETBP1, EZH2 and WT1; the genotype of each event is shown.}
\end{figure*}

In TRONCO there is also a function which creates the pathway view and the corresponding oncoprint to  multiple pathways, when these are given as a list. We make here a simple example of two custom pathways.


<<onco-pathway-viz, fig.show='hide', fig.width=6.5, fig.height=1.8>>=
pathway.visualization(aCML, 
    pathways=list(P1 = c('TET2', 'IRAK4'),  P2=c('SETBP1', 'KIT')),        
    aggregate.pathways=FALSE,
    font.row = 8)
@

If we had to visualize just the signature of the pathway, we could set  {\tt aggregate.pathways=T}.

<<onco-pathway-viz2, fig.show='hide', fig.width=6.5, fig.height=1, results='hide'>>=
pathway.visualization(aCML, 
    pathways=list(P1 = c('TET2', 'IRAK4'),  P2=c('SETBP1', 'KIT')),
    aggregate.pathways = TRUE,
    font.row = 8)
@

\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.9\textwidth]{figure/onco-pathway-viz-1}
\includegraphics[width=0.9\textwidth]{figure/onco-pathway-viz2-1}
\end{center}
\vspace*{.05in}
\caption{\texttt{oncoprint} output of a custom pair of pathways, with events shown in top and hidden in bottom.}
\end{figure*}


The same operation could have been done using \href{https://wikipathways.org}{WikiPathways}. We can query WikiPathways and collect HGNC gene symbols and titles for pathways of interest as follows. (R's output is omitted).

<<pathway-alt, eval=FALSE>>=
library(rWikiPathways)
# quotes inside query to require both terms
my.pathways <- findPathwaysByText('SETBP1 EZH2 TET2 IRAK4 SETBP1 KIT')
human.filter <- lapply(my.pathways, function(x) x$species == "Homo sapiens")
my.hs.pathways <- my.pathways[unlist(human.filter)] 
# collect pathways idenifiers
my.wpids <- sapply(my.hs.pathways, function(x) x$id)

pw.title<-my.hs.pathways[[1]]$name
pw.genes<-getXrefList(my.wpids[1],"H") 
@

Now {\tt pw.genes} and {\tt pw.title} can be used as input for the function {\tt as.pathway}.
It is also possible to view and edit these pathways at WikiPathways using the following commands to open tabs in your default browser.

<<wikipathways, eval=FALSE>>=
browseURL(getPathwayInfo(my.wpids[1])[2])
browseURL(getPathwayInfo(my.wpids[2])[2])
browseURL(getPathwayInfo(my.wpids[3])[2])
@
    

\FloatBarrier

\section{Data manipulation} \label{sec:datamanip}

All examples in this section will be done with the the aCML dataset as reference.

\subsection{Modifying events and samples}

TRONCO  provides functions for renaming the events that were included in a dataset, or the type associated to a set of events (e.g., a ``Mutation'' could be renamed to a ``Missense Mutation'').


<<rename>>=
dataset = rename.gene(aCML, 'TET2', 'new name')
dataset = rename.type(dataset, 'Ins/Del', 'new type')
as.events(dataset, type = 'new type')
@


and return a modified TRONCO object. More complex operations are also possible. For instance, two events with the same signature -- i.e., appearing in the same samples -- can be joined to a new event (see also Data Consolidation in Model Inference) with the same signature and a new name.


<<join1>>=
dataset = join.events(aCML, 
    'gene 4',
    'gene 88',
    new.event='test',
    new.type='banana',
    event.color='yellow')
@

    
where in this case we also created a new event type, with its own color.

In a similar way we can decide to join all the events of two distinct types, in this case if a gene $x$ has signatures for both type of events, he will get a unique signature with an alteration present if it is either of the second {\em or} the second type 

<<join2>>=
dataset = join.types(dataset, 'Nonsense point', 'Nonsense Ins/Del')
as.types(dataset)
@


TRONCO also provides  functions for deleting specific events, samples or types.


<<delete>>=
dataset = delete.gene(aCML, gene = 'TET2')
dataset = delete.event(dataset, gene = 'ASXL1', type = 'Ins/Del')
dataset = delete.samples(dataset, samples = c('patient 5', 'patient 6'))
dataset = delete.type(dataset, type = 'Missense point')
view(dataset)
@

    
\subsection{Modifying patterns}

TRONCO  provides functions to edit patterns,  pretty much as for any other type of events. Patterns however have a special denotation and are supported only by CAPRI algorithm -- see Model Reconstruction with CAPRI to see a practical application of that.


\subsection{Subsetting a dataset}


It is very often the case that we want to subset a dataset by either selecting only some of its samples, or some of its events. Function \texttt{samples.selection} returns a dataset with only some selected samples.


<<assamples-assamples>>=
dataset = samples.selection(aCML, samples = as.samples(aCML)[1:3])
view(dataset)
@


Function \texttt{events.selection}, instead, performs selection according to a filter of events. With this function, we can subset data according to a frequency, and we can force inclusion/exclusion of certain events by specifying their name. For instance, here we pick all events with a minimum frequency of 5\%, force exclusion of SETBP1 (all events associated), and inclusion of EZH1 and EZH2.

<<eventsselection>>=
dataset = events.selection(aCML,  filter.freq = .05, 
    filter.in.names = c('EZH1','EZH2'), 
    filter.out.names = 'SETBP1')
@
<<eventsselection2>>=
as.events(dataset)
@


An example visualization of the data before and after the selection process can be obtained by combining the {\tt gtable} objects returned by \Rfunction{oncoprint}. We here use 
{\tt gtable = T} to get access to have a  GROB table returned, and {\tt silent = T} to avoid that the calls to the function display on the device; the call to \Rfunction{grid.arrange} displays the captured {\tt gtable} objects.


<<onco-ex-sel, fig.show='hide', fig.width=7, fig.height=5.5, results='hide'>>=
library(gridExtra)
grid.arrange(
    oncoprint(as.alterations(aCML, new.color = 'brown3'), 
        cellheight = 6, cellwidth = 4, gtable = TRUE,
        silent = TRUE, font.row = 6)$gtable,
    oncoprint(dataset, cellheight = 6, cellwidth = 4,
        gtable = TRUE, silent = TRUE, font.row = 6)$gtable, 
    ncol = 1)
@

\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.9\textwidth]{figure/onco-ex-sel-1}
\end{center}
\caption{Multiple output from \Rfunction{oncoprint} can be captured as a {\tt gtable} and composed via {\tt grid.arrange} (package \CRANpkg{ gridExtra}). In this case we show  aCML data on top -- displayed after the \Rfunction{as.alterations} transformation -- versus a selected subdataset of events with a minimum frequency of 5\%, force exclusion of SETBP1 (all events associated), and inclusion of EZH1 and EZH2.}
\end{figure*}

\FloatBarrier

\section{Model inference}

We make use of the most of the  functions described above to show how to perform  inference with various algorithms; the reader should read first those sections  of the vignette to have an explanation of how those functions work.  The aCML dataset is used as a test-case for all algorithms, regardless it should be precessed by algorithms to infer ensemble-level progression models.

To replicate the plots of the original  paper were the aCML dataset was first analyzed with  {CAPRI}, we can change the colors assigned to each type of event with the function \texttt{change.color}. 


<<>>=
dataset = change.color(aCML, 'Ins/Del', 'dodgerblue4')
dataset = change.color(dataset, 'Missense point', '#7FC97F')
as.colors(dataset)
@


\paragraph{Data consolidation.} All TRONCO algorithms require an input dataset were events have non-zero/non-one probability, and are all distinguishable. The tool provides a function to  return lists of events which do not satisfy these constraint.

<<>>=
consolidate.data(dataset)
@


The aCML data has none of the above issues (the call returns empty lists); if this were not the case data manipulation functions can be used to edit a TRONCO object (\S \ref{sec:datamanip}).


\subsection{CAPRI}

 In what follows, we show CAPRI's functioning by replicating the aCML case study presented in CAPRI's original paper. Regardless from which types of mutations we include, we select only the genes  mutated at least in the $5\%$ of the patients -- thus   we first use \Rfunction{as.alterations} to have gene-level frequencies, and then we apply there a frequency filter (R's output is omitted). 


<<other-alterations>>=
alterations = events.selection(as.alterations(aCML), filter.freq = .05)
@


To proceed further with the example we  create the \texttt{dataset} to be used for the inference of the  model. From the original dataset  we select all the genes whose mutations are occurring at least the $5\%$ of the times, and we get that by the alterations profiles; also we force inclusion of all the events for the  genes involved in an hypothesis (those included in variable {\tt gene.hypotheses}, this list is based on the support  found in the literature of potential aCML patterns).


<<>>=
gene.hypotheses = c('KRAS', 'NRAS', 'IDH1', 'IDH2', 'TET2', 'SF3B1', 'ASXL1')
aCML.clean = events.selection(aCML,
    filter.in.names=c(as.genes(alterations), gene.hypotheses))
aCML.clean = annotate.description(aCML.clean, 
    'CAPRI - Bionformatics aCML data (selected events)')
@


We show a new oncoprint of this latest dataset where we annotate the genes in \texttt{gene.hypotheses} in order to identify them. The sample names are also shown. 


<<onco-edited, fig.show='hide', fig.width=8, fig.height=5.5, results='hide'>>=
oncoprint(aCML.clean, gene.annot = list(priors = gene.hypotheses), sample.id = TRUE)
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.8\textwidth]{figure/onco-edited-1}
\end{center}
\vspace*{.05in}
\caption{Data selected for aCML reconstruction annotated with the events which are part of a pattern that we will input to CAPRI.}
\end{figure*}

\FloatBarrier

\subsubsection{Testable hypotheses via logical formulas (i.e., patterns)}

CAPRI is the only algorithm in TRONCO that supports hypotheses-testing of  causal structures expressed as logical formulas with AND, OR and XOR operators.  An example invented formula could be
\begin{center}
(APC:Mutation XOR APC:Deletion) OR CTNNB1:Mutation
\end{center}
where APC mutations and deletions are in disjunctive relation with CTNNB1 mutations; this is done to  test if those events could confer equivalent fitness in terms of ensemble-level progression -- see the original CAPRI paper and the PiCnIc pipeline for detailed explanations.

Every formula is transformed into a CAPRI  ``pattern''. For every hypothesis it is possible  to specify against which possible target event it should be tested, e.g., one might test the above formula against PIK3CA mutations, but not ATM ones. If this is not done, a pattern is tested against all other events in the dataset but those which constitute itself. A pattern tested against one other event is called an hypothesis.

\paragraph{Adding custom hypotheses.}

We   add the hypotheses that are described in CAPRI's manuscript; we start with  hard exclusivity (XOR) for NRAS/KRAS mutation, 
\begin{center}
NRAS:Missense point XOR KRAS:Missense point
\end{center}
tested against all the events in the dataset (default {\tt pattern.effect = *})
 

<<>>=
aCML.hypo = hypothesis.add(aCML.clean, 'NRAS xor KRAS', XOR('NRAS', 'KRAS'))
@


When a pattern is included, a new column in the dataset is created -- whose signature is given by the evaluation of the formula constituting the pattern. We call this operation {\bf lifting of a pattern}, and this shall create not inconsistency in the data -- i.e., it shall not duplicate any of the other columns. TRONCO check this; for instance when we try to include  a soft exclusivity (OR) pattern for the above genes we get an error (not shown).

<<eval=FALSE>>=
aCML.hypo = hypothesis.add(aCML.hypo, 'NRAS or KRAS',  OR('NRAS', 'KRAS'))
@


Notice that TRONCO functions can be used to look at their alterations and understand why the OR signature is equivalent to the XOR one -- this happens as no samples harbour both mutations.


<<onco-kras-nras, fig.show='hide', fig.width=6, fig.height=1, results='hide'>>=
oncoprint(events.selection(aCML.hypo,
    filter.in.names = c('KRAS', 'NRAS')),
    font.row = 8,
    ann.hits = FALSE)
@


We repeated the same analysis as before for other hypotheses and for the same reasons, we will include only the hard exclusivity pattern. In this case we add a two-levels pattern
\begin{center}
SF3B1:Missense point XOR (ASXL1:Ins/Del XOR ASXL1:Nonsense point)
\end{center}
since ASXL1 is mutated in two different ways, and no samples harbour both mutation types. 


<<>>=
aCML.hypo = hypothesis.add(aCML.hypo, 'SF3B1 xor ASXL1', XOR('SF3B1', XOR('ASXL1')),
    '*')
@


Finally, we now do the same for genes TET2 and IDH2. In this case $3$ events for the gene TET2 are present, that is ``Ins/Del'', ``Missense point'' and ``Nonsense point''. For this reason, since we are not specifying any subset of such events to be considered, all TET2 alterations are used. Since the events present a perfect hard exclusivity, their patters will be included as a XOR. 


<<>>=
as.events(aCML.hypo, genes = 'TET2') 
aCML.hypo = hypothesis.add(aCML.hypo,
    'TET2 xor IDH2',
    XOR('TET2', 'IDH2'),
    '*')
aCML.hypo = hypothesis.add(aCML.hypo,
    'TET2 or IDH2',
    OR('TET2', 'IDH2'),
    '*')
@


Which is the following pattern
\begin{center}
(TET2:Ins/Del) XOR (TET2:Missense point) XOR (TET2:Nonsense point) XOR (IDH2:Missense point)
\end{center}
which we can visualize via an{\tt  oncoprint}.


<<onco-tet2-idh2, fig.show='hide', fig.width=7, fig.height=2, results='hide'>>=
oncoprint(events.selection(aCML.hypo,
    filter.in.names = c('TET2', 'IDH2')),
    font.row = 8,
    ann.hits = FALSE)
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.8\textwidth]{figure/onco-kras-nras-1}
\includegraphics[width=0.8\textwidth]{figure/onco-tet2-idh2-1}
\end{center}
\vspace*{.05in}
\caption{\texttt{oncoprint} output to show the perfect (hard) exclusivity among NRAS/KRAS mutations in aCML on top, and the sofy-one among  TET2 and IDH2 alterations.}
\end{figure*}

\FloatBarrier

\paragraph{Adding (automatically) hypotheses for homologous events.} We consider homologous events those having the same mnemonic name -- as of function \Rfunction{as.genes} -- but events of different type. For instance, mutations and deletions of the same gene would be considered such (e.g., in the aCML dataset ASXL1 Ins/Del and Nonsense point).
It could be a good idea to test such events, in terms of progression fitness, to test is they might be equivalent; we can do that by building a pattern of exclusivity among them. TRONCO has a function to make this automatically which, by default, adds a soft exclusivity OR pattern among them. 


<<>>=
aCML.hypo = hypothesis.add.homologous(aCML.hypo)
@


This function added one  pattern for each of TET2, EZH2, CBL, ASXL1, CSF3R (unless they created duplicated columns in the dataset), with a connective OR/XOR which is appropriate for the events considered; for instance the TET2 homologous pattern 
\begin{center}
(TET2:Ins/Del) XOR (TET2:Missense point) XOR (TET2:Nonsense point) 
\end{center}
was created with a XOR function, as TET2 appears in perfect exclusivity.


\paragraph{Adding (automatically) hypotheses for a group of genes.}

The idea behind the previous function is generalized by \Rfunction{hypothesis.add.group}, that add a set of hypotheses that can be combinatorially created out of a  group of genes. As such, this function can create an exponential number of hypotheses and should be used with caution as too many hypotheses, with respect to sample size, should not be included. 

This function takes, among its inputs, the top-level logical connective, AND/OR/XOR, a minimum/maximum pattern size -- to restrict the combinatorial sampling of subgroups --, plus a parameter that can be used to constrain the minimum event frequency. If, among the events included some of them have homologous, these are put automatically nested with the same logic of the \Rfunction{hypothesis.add.group} function.


<<hypo-add-hom>>=
dataset = hypothesis.add.group(aCML.clean, OR, group = c('SETBP1', 'ASXL1', 'CBL'))
@


\FloatBarrier


The final dataset that will be given as input to CAPRI is  finally shown. Notice the signatures of  all the lifted patterns.

<<onco-priors, fig.show='hide', fig.width=8, fig.height=6.5,results='hide'>>=
oncoprint(aCML.hypo, gene.annot = list(priors = gene.hypotheses), sample.id = TRUE, 
    font.row=10, font.column=5, cellheight=15, cellwidth=4)
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.9\textwidth]{figure/onco-priors-1}
\end{center}
\vspace*{.05in}
\caption{\texttt{oncoprint} output of the a dataset that has patterns that could be given as input to CAPRI to retrieve a progression model.}
\end{figure*}

\FloatBarrier

\paragraph{Querying, visualizing and manipulating CAPRI's patterns.}


We also provide functions to get the number of hypotheses and patterns present in the data.

<<n-hypo-pat>>=
npatterns(dataset)
nhypotheses(dataset)
@


We can visualize any pattern or the elements involved in them with the following functions.

<<as-patterns>>=
as.patterns(dataset)
as.events.in.patterns(dataset)
as.genes.in.patterns(dataset)
as.types.in.patterns(dataset)
@


Similarily, we can enumerate the  hypotheses with the function \texttt{as.hypotheses}, and delete certain patterns and hypotheses. Deleting a pattern consists in deleting all of its hypotheses.


<<as-hypotheses>>=
head(as.hypotheses(dataset))
dataset = delete.hypothesis(dataset, event = 'TET2')
dataset = delete.pattern(dataset, pattern = 'OR_ASXL1_CBL')
@


\paragraph{How to build a pattern.}

It is sometimes of help to plot some information about a certain combination of events, and a target -- especially to disentangle the proper logical connectives to use when building a pattern. Here, we test genes SETBP1 and ASXL1 versus Missense point mutations of  CSF3R, and observe that the majority of observations are mutually exclusive, but almost half of the  CSF3R mutated samples with Missense point mutations do not harbout any mutation in SETBP1 and ASXL1.


<<pattern-plot,fig.show='hide', fig.width=4, fig.height=2.2>>=
tronco.pattern.plot(aCML,
    group = as.events(aCML, genes=c('SETBP1', 'ASXL1')),
    to = c('CSF3R', 'Missense point'),
    legend.cex=0.8,
    label.cex=1.0)
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.8\textwidth]{figure/pattern-plot-1}
\end{center}
\caption{Barplot to show an hypothesis: here we test genes SETBP1 and ASXL1 versus Missense point mutations of  CSF3R, which suggests that  that pattern does not ``capture'' all the samples with  CSF3R mutations. }
\end{figure*}

\FloatBarrier

 It is also possible to create a circle plot where we can observe the contribution of genes SETBP1 and ASXL1 in every match with a Missense point mutations of  CSF3R.


<<pattern-plot-circos,fig.show='hide',results='hide',  fig.width=6, fig.height=6>>=
tronco.pattern.plot(aCML,
    group = as.events(aCML, genes=c('TET2', 'ASXL1')),
    to = c('CSF3R', 'Missense point'),
    legend = 1.0,
    label.cex = 0.8,
    mode='circos')
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.8\textwidth]{figure/pattern-plot-circos-1}
\end{center}
\caption{Circos to show an hypothesis: here we test genes SETBP1 and ASXL1 versus Missense point mutations of  CSF3R. The combination of this and the previous  plots should allow to understand which pattern we shall write in an attempt to capture a potential causality relation between the pattern and the event.
}
\end{figure*}

\FloatBarrier


\subsubsection{Model reconstruction}

We run the inference of the model by CAPRI algorithm with its default parameter: we use both AIC and BIC as regularizators, Hill-climbing as heuristic search of the solutions and exhaustive bootstrap ({\tt nboot} replicates or more for Wilcoxon testing, i.e., more iterations can be performed if samples are rejected), p-value are set at $0.05$. We set the seed for the sake of reproducibility. 

<<delete-description, results='hide', include=FALSE>>==
aCML.hypo = annotate.description(aCML.hypo, '')
aCML.clean = annotate.description(aCML.clean, '')
@

<<model-capri>>=
model.capri = tronco.capri(aCML.hypo, boot.seed = 12345, nboot = 5)
model.capri = annotate.description(model.capri, 'CAPRI - aCML')
@


\subsection{CAPRESE}

The CAPRESE algorithm is one of a set of algorithms to reconstruct progression models from data of an individual patient. This algorithm uses a shrinkage-alike estimator combining correlation and probability raising among pair of events. This algorithm shall return a forest of trees, a special case of a Suppes-Bayes Causal Network.

Despite this is derived to infer progression models from individual level data, we use it here to process aCML data (without patterns and with its default parameters). This algorithm has no bootstrap and, as such, is the quickest available in TRONCO.


<<caprese-plot>>=
model.caprese = tronco.caprese(aCML.clean)
model.caprese = annotate.description(model.caprese, 'CAPRESE - aCML')
@


\FloatBarrier

\subsection{Directed Minimum Spanning Tree with Mutual Information}

This algorithm is meant to  extract a forest of trees of progression from data of an individual patient. This algorithm is based on a 
formulation of the  problem in terms of minimum spamming trees and exploits results from Edmonds. We test it to infer a model from aCML data as we did with CAPRESE.


<<edmonds-plot>>=
model.edmonds = tronco.edmonds(aCML.clean, nboot = 5, boot.seed = 12345)
model.edmonds = annotate.description(model.edmonds, 'MST Edmonds - aCML')
@


\FloatBarrier

\subsection{Partially Directed Minimum Spanning Tree with Mutual Information}

This algorithm extends the previous one in situations where it is not possible to fully assess a confident time ordering among the nodes, hence leading to a partially directed input. This algorithm adopts Gabow search strategy to evaluate the best directed minimum spanning tree among such undirected components. We test it to infer a model from aCML data as all the other algorithms.


<<gabow-plot>>=
model.gabow = tronco.gabow(aCML.clean, nboot = 5, boot.seed = 12345)
model.gabow = annotate.description(model.gabow, 'MST Gabow - aCML')
@


\FloatBarrier


\subsection{Undirected Minimum Spanning Tree with Likelihood-Fit }

This algorithm is meant   to extract a progression  from data of an individual patient, but it is not constrained to retrieve a tree/forest -- i.e., it could retrieve a direct acyclic graph -- according to the level of noise and heterogeneity of the input data. This algorithm is based on a 
formulation of the  problem in terms of minimum spamming trees and exploits results from Chow Liu and other variants for likelihood-fit. Thus, this algorithm is executed with potentially multiple regularizator as CAPRI -- here we use BIC/AIC.

 We test it to aCML data as all the other algorithms.


<<chow-liu-plot>>=
model.chowliu = tronco.chowliu(aCML.clean, nboot = 5, boot.seed = 12345)
model.chowliu = annotate.description(model.chowliu, 'MST Chow Liu - aCML')
@


\FloatBarrier

\subsection{Undirected Minimum Spanning Tree with Mutual Information}

This algorithm is meant   to extract a progression  from data of an individual patient. As the Chow Liu algorithm, this  could retrieve a direct acyclic graph  according to the level of noise and heterogeneity of the input data. This algorithm formulatesf the  problem in terms of undirected minimum spamming trees and exploits results from Prim, which are a generalization of Edomonds' ones.  We test it to aCML data as all the other algorithms.


<<prim-plot>>=
model.prim = tronco.prim(aCML.clean, nboot = 5, boot.seed = 12345)
model.prim = annotate.description(model.prim, 'MST Prim - aCML data')
@


\FloatBarrier


\section{Post-reconstruction}

TRONCO provides functions to plot a model, access information about the probabilities used to extract it from data, and two types of confidence measures: those used to infer the model, and those computed a posteriori from it.

Function \Rfunction{view} provides updated information about a model if this is available. 

<<>>=
view(model.capri)
@


\subsection{Visualizing a reconstructed model}

We can plot a model by using function \Rfunction{tronco.plot}. Here, we plot the aCML model inferred by CAPRI with BIC and AIC as a regolarizator. We set some parameters to get a nice plot (scaling etc.), and distinguish the edges detected by the two regularization techniques. The confidence of each edge is shown  in terms of temporal priority and probability raising (selective advantage scores) and hypergeometric testing (statistical relevance of the dataset of input). Events are annotated as in the oncoprint, edge p-values above a minium threshold (default 0.05) are red.


<<capri-plot,fig.show='hide',fig.width=4,fig.height=4,warning=FALSE>>=
tronco.plot(model.capri, 
    fontsize = 12, 
    scale.nodes = 0.6, 
    confidence = c('tp', 'pr', 'hg'), 
    height.logic = 0.25, 
    legend.cex = 0.35, 
    pathways = list(priors = gene.hypotheses), 
    label.edge.size = 10)
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.8\textwidth]{figure/capri-plot-1}
\end{center}
\caption{aCML model reconstructed by CAPRI with  AIC/BIC as  regolarizators; the confidence of each edge is shown both in terms of temporal priority and probability raising (selective advantage scores) and hypergeometric testing (statistical relevance of the dataset of input).}
\end{figure*}

\FloatBarrier

We can also make a multiplot with this function,  which in this case we do by showing the models inferred by the other algorithms based on Minimum Spanning Trees.


<<mst-plot,fig.show='hide',results='hide',fig.width=7,fig.height=7,warning=FALSE>>=
par(mfrow = c(2,2))
tronco.plot(model.caprese, fontsize = 22, scale.nodes = 0.6, legend = FALSE)
tronco.plot(model.edmonds, fontsize = 22, scale.nodes = 0.6, legend = FALSE)
tronco.plot(model.chowliu, fontsize = 22, scale.nodes = 0.6, legend.cex = .7)
tronco.plot(model.prim, fontsize = 22, scale.nodes = 0.6, legend = FALSE)
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=1\textwidth]{figure/mst-plot-1}
\end{center}
\caption{aCML data processed model by algorithms to extract models from individual patients, we show the otput of  CAPRESE,  and all algorithms based on Minimum Spanning Trees (Edmonds, Chow Liu and Prim). Only the  model retrieved by Chow Liu has two different edge colors as it was regularized with two different strategies: AIC and BIC.
}
\end{figure*}

\subsection{Accessing information within a model (e.g., confidence)}

We can visualize a summary of the parameters   used for the reconstruction, test if an object has a model or delete it (which shall be done to retrieve the original dataset).

<<>>=
 as.data.frame(as.parameters(model.capri))
has.model(model.capri)
dataset = delete.model(model.capri)
@


\paragraph{Model structure.}
A set of functions can be used to visualize the content of object which contains the reconstructed model. For instance, we can access the adjacency matrix of a model by using \Rfunction{as.adj.matrix} which will return a matrix for each one of the regularizators used -- in this case because CAPRI was run with both BIC/AIC.


<<>>=
str(as.adj.matrix(model.capri))
@


\paragraph{Empirical probabilities.}

Every model is inferred by estimating the empirical marginal, joint and conditional probabilities for all the events, from input data. These in some cases are estimated by a bootstrap procedure (see the algorithms implemented). TRONCO has functions to extract such table, that could be in turn printed by using external  functions for, e.g., heatmap visualization (see below for an example via the \CRANpkg{pheatmap} package). We show these functions working with the CAPRI model; in this case the tables are the same for both BIC/AIC structures as they are computed before performing penalized likelihood-fit. The marginal  $P(x)$ for $x$ an event in the dataset are obtained by \Rfunction{as.marginal.probs}. 

<<>>=
marginal.prob = as.marginal.probs(model.capri)
head(marginal.prob$capri_bic)
@


Similarly,   the joint $P(x,y)$ for every pair of events in the dataset is given by \Rfunction{as.joint.probs}.

<<>>=
joint.prob = as.joint.probs(model.capri, models='capri_bic')
joint.prob$capri_bic[1:3, 1:3]
@


And \Rfunction{as.conditional.probs} finally gives the conditional $P(x\mid y)$ for every edge in the dataset.

<<>>=
conditional.prob = as.conditional.probs(model.capri, models='capri_bic')
head(conditional.prob$capri_bic)
@


\paragraph{Confidence measures.}

Confidence scores  can be accessed by function \Rfunction{as.confidence}, which takes as parameter the type of confidence measure that one wants to access to. This will work for either confidence measures assessed before reconstructing the model -- if available --, or afterwards.

<<>>=
str(as.confidence(model.capri, conf = c('tp', 'pr', 'hg')))
@


Other functions  visualize  tables summarizing the  statistics for each edge in the model, For instance, if one uses function \Rfunction{as.selective.advantage.relations}  the  p-values for temporal priority, probability raising and hypergeometric testing, as well as other information about each edge  can be accessed, e.g., the number of observations for the upstream and the downstream events.


<<selective-advantage>>=
as.selective.advantage.relations(model.capri)
@


\subsection{Confidence via non-parametric and statistical bootstrap}

TRONCO provides three different strategies to perform  bootstrap and assess confidence of each edge in terms of a score in the range [0, 100] (100 is the highest confidence). Non-parametric (default) and statistical bootstrap strategies are available, and can be executed by calling function \Rfunction{tronco.bootstrap} with {\tt type} parameter set appropriately. This function is parallel, and parameter {\tt cores.ratio} (default 1) can be used to percentage of available cores that shall be used to compute the scores.
Parameter {\tt nboot} controls the number of bootstrap iterations.


<<>>=
model.boot = tronco.bootstrap(model.capri, nboot = 3)
model.boot = tronco.bootstrap(model.boot, nboot = 3, type = 'statistical')
@


Bootstrap scores can be annotated to the \Rfunction{tronco.plot} output by setting them via the confidence parameter {\tt confidence=c('npb', 'sb')}. In this case edge thickness will be proportional to the non-parametric ({\tt npb}) scores -- the last to appear in the {\tt confidence} parameter.

<<figplotboot,fig.show='hide',fig.width=4,fig.height=4,warning=FALSE>>=
tronco.plot(model.boot, 
    fontsize = 12, 
    scale.nodes = .6,   
    confidence=c('sb', 'npb'), 
    height.logic = 0.25, 
    legend.cex = .35, 
    pathways = list(priors= gene.hypotheses), 
    label.edge.size=10)
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.8\textwidth]{figure/figplotboot-1}
\end{center}
\caption{aCML model reconstructed by CAPRI with  AIC/BIC as  regolarizators and annotated with both non-parametric and statistical bootstrap scores. 
Edge thickness is proportional to   the non-parametric  scores.}
\end{figure*}

\FloatBarrier


Bootstrap scores can extracted or visualized even with other TRONCO functions. For instance, we can accessall scores via \Rfunction{as.bootstrap.scores}, which resembles  function \Rfunction{as.selective.advantage.relations} and will display the scores per edge. Notice that even function \Rfunction{view} gives an update output by mentioning the available bootstrap scores.


<<bootstrap-table>>=
as.bootstrap.scores(model.boot)
view(model.boot)
@


If we want to access a matrix with the scores and visualize that in a heatmap we can use for instance the \Rfunction{pheatmap} function of TRONCO. In this case we need to use also function \Rfunction{keysToNames} to translate internal TRONCO keys to mnemonic names in the plot 


<<hboot,fig.show='hide',fig.width=7,fig.height=7>>=
pheatmap(keysToNames(model.boot, as.confidence(model.boot, conf = 'sb')$sb$capri_aic) * 100, 
           main =  'Statistical bootstrap scores for AIC model',
           fontsize_row = 6,
           fontsize_col = 6,
           display_numbers = TRUE,
           number_format = "%d"
           )
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.7\textwidth]{figure/hboot-1} 
\end{center}
\vspace*{.05in}
\caption{Heatmap of the bootstrap scores for the CAPRI aCML model (via AIC regularization).}
\end{figure*}

\FloatBarrier

\subsection{Confidence via cross-validation (entropy loss, prediction and posterior classification errors)}

TRONCO implements $k$-fold cross-validation routines (from the \CRANpkg{bnlearn} package) to provide estimates of the following statistics:
\begin{itemize}
\item  the {\em negative entropy} (via \Rfunction{tronco.kfold.eloss}) of a whole model ? i.e., the negated expected log-likelihood of the test set for the Bayesian network fitted from the training set.
\item the {\em prediction error} (via \Rfunction{tronco.kfold.prederr}) for a single node $x$ and its parents set $X$ -- i.e.,  how precisely we can predict the values of $x$ by using only the information present in its local distribution, via $X$.
\item the {\em posterior classification error} (via \Rfunction{tronco.kfold.posterr})  for a single node $x$ and one of its parent node $y \in X$ -- i.e., the values of $x$ are predicted using only the information present in $y$ by likelihood weighting and Bayesian posterior estimates.
\end{itemize}

By default, a 10 repetitions from  10-fold cross-validation experiment are perfomed, for all the models which are found inside a TRONCO object -- in this case 2, one for CAPRI with BIC and one for CAPRI with AIC.
 

<<kfold>>=
model.boot = tronco.kfold.eloss(model.boot)
model.boot = tronco.kfold.prederr(model.boot, runs = 2)
model.boot = tronco.kfold.posterr(model.boot, runs = 2)
@


These results can be visualized in terms of summary tables, as for the other confidence scores.


<<as-kfold-ex>>=
as.kfold.eloss(model.boot)
as.kfold.prederr(model.boot)
as.kfold.posterr(model.boot)
@


Notice that these can be combined to create a nice table with all these statistics  -- we make here the example of a table with all the BIC statistics. This format can be readily exported to external spreadsheets for further visualization.

<<as-kfold-tab>>=
tabular = function(obj, M){
    tab = Reduce(
        function(...) merge(..., all = TRUE), 
            list(as.selective.advantage.relations(obj, models = M),
                as.bootstrap.scores(obj, models = M),
                as.kfold.prederr(obj, models = M),
                as.kfold.posterr(obj,models = M)))
  
    # merge reverses first with second column
    tab = tab[, c(2,1,3:ncol(tab))]
    tab = tab[order(tab[, paste(M, '.NONPAR.BOOT', sep='')], na.last = TRUE, decreasing = TRUE), ]
    return(tab)
}

head(tabular(model.boot, 'capri_bic'))
@


We finally show the plot of the model with the confidences by cross-validation.


<<plot-conf,fig.show='hide',fig.width=4,fig.height=4,warning=FALSE>>=
tronco.plot(model.boot, 
    fontsize = 12, 
    scale.nodes = .6, 
    confidence=c('npb', 'eloss', 'prederr', 'posterr'), 
    height.logic = 0.25, 
    legend.cex = .35, 
    pathways = list(priors= gene.hypotheses), 
    label.edge.size=10)
@


\begin{figure*}[ht]
\begin{center}
\includegraphics[width=0.8\textwidth]{figure/plot-conf-1}
\end{center}
\caption{aCML model reconstructed by CAPRI with  AIC/BIC as regolarizators and annotated with  non-parametric, as well as with entropy loss, prediction and posterior classification errors computed via cross-validation. Edge thickness is proportional to   the non-parametric  scores.}
\end{figure*}

\FloatBarrier


\section{Import/export to other tools}

We implemented the interface of TRONCO with other tools to support the {\bf Pipeline for Cancer Inference PiCnIc}, our attempt at devise an effective pipeline to extract ensemble-level cancer progression models from cross-sectional data (see \ref{sec:stuff}). 


PiCnIc is versatile, modular and customizable and exploits state-of-the-art data processing and machine learning tools to: 
\begin{enumerate}
\item  identify tumor subtypes and then in each subtype;
\item select (epi)genomic events driving the progression;
\item identify groups of events that are likely to be observed as mutually exclusive;
\item infer progression models from groups and such data, and annotate them with associated statistical confidence.
\end{enumerate}

The algorithms for cancer progression inference exploited by PicNiC are implemented within TRONCO, the other steps of the pipeline rely on dedicated tools (for clustering, drivers selection and exclusiviyt groups detection). The tools that PicNiC can exploit are of different nature, and we plan to interface them with TRONCO as far as our case studies are developed.

The current version of TRONCO supports input/output towards these tools:
\begin{enumerate}
\item \href{http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3866081/}{Network Based Stratification (NBS)},  a method for stratification (clustering) of patients in a cancer cohort based on genome scale somatic mutations measurements and a gene interaction network. You can export a TRONCO object in the NBS input format with  function \Rfunction{export.nbs.input}, clustering outputs can be handled with standard TRONCO functions.
\item \href{http://www.ncbi.nlm.nih.gov/pubmed/25887147}{MUTEX}, a method for the identification of sets of mutually exclusive gene alterations in a given set of genomic profiles by scanning the groups of genes with a common downstream effect on the signaling network. You can export a TRONCO object in the MUTEX input format with  function \Rfunction{export.mutex}, and output results can be imported with function \Rfunction{import.mutex.groups}.
\end{enumerate}

Finally we also privede the possibility of exporting the inferred model to {\tt graphML} format, which can be subsequently imported to \software{Cytoscape}. We now provide an example of such function.

<<graphml, fig.show='hide',warning=FALSE>>=
export.graphml(model.boot,
    file = 'graph.gml',
    fontsize = 12, 
    scale.nodes = .6, 
    height.logic = 0.25)
@

Follows an example of file generated by such function. Futhermore on the TRONCO website a \software{Cytoscape} style to visualize the progressions is available.

\begin{verbatim}
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
         http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <!-- Created by igraph -->
  <key id="g_name" for="graph" attr.name="name" attr.type="string"/>
  <key id="g_models" for="graph" attr.name="models" attr.type="string"/>
  <key id="g_informations" for="graph" attr.name="informations" attr.type="string"/>
  <key id="v_name" for="node" attr.name="name" attr.type="string"/>
  <key id="v_label" for="node" attr.name="label" attr.type="string"/>
  <key id="v_type" for="node" attr.name="type" attr.type="string"/>
  <key id="v_fillcolor" for="node" attr.name="fillcolor" attr.type="string"/>
  <key id="v_fontcolor" for="node" attr.name="fontcolor" attr.type="string"/>
  <key id="v_bordercolor" for="node" attr.name="bordercolor" attr.type="string"/>
  <key id="v_shape" for="node" attr.name="shape" attr.type="string"/>
  <key id="v_width" for="node" attr.name="width" attr.type="double"/>
  <key id="v_height" for="node" attr.name="height" attr.type="double"/>
  <key id="v_fontsize" for="node" attr.name="fontsize" attr.type="double"/>
  <key id="v_borderwidth" for="node" attr.name="borderwidth" attr.type="double"/>
  <key id="e_weight" for="edge" attr.name="weight" attr.type="double"/>
  <key id="e_line" for="edge" attr.name="line" attr.type="string"/>
  <key id="e_arrow" for="edge" attr.name="arrow" attr.type="string"/>
  <key id="e_color" for="edge" attr.name="color" attr.type="string"/>
  <graph id="G" edgedefault="directed">
    <data key="g_name">CAPRI - aCML</data>
    <data key="g_models">CAPRI capri_bic - CAPRI capri_aic</data>
    <data key="g_informations">Generated with TRONCO v2.3.0</data>
    <node id="n0">
      <data key="v_name">CSF3R</data>
      <data key="v_label">CSF3R 4% (3)</data>
      <data key="v_type">Nonsense point</data>
      <data key="v_fillcolor">#FAB3D8</data>
      <data key="v_fontcolor">#000000</data>
      <data key="v_bordercolor">#000000</data>
      <data key="v_shape">ellipse</data>
      <data key="v_width">109.2</data>
      <data key="v_height">72.8</data>
      <data key="v_fontsize">18</data>
      <data key="v_borderwidth">1</data>
    </node>
    [...]
    <edge source="n1" target="n0">
      <data key="e_weight">1</data>
      <data key="e_line">Solid</data>
      <data key="e_arrow">True</data>
      <data key="e_color">#A9A9A9</data>
    </edge>
    [...]
  </graph>
</graphml>
\end{verbatim}

\section{\Rcode{sessionInfo()}}

<<sessioninfo, results='asis', eval=TRUE, echo=FALSE>>=
toLatex(sessionInfo())
@

\end{document}