---
title: "MultiAssayExperiment: The Integrative Bioconductor Container"
author: "MultiAssay Special Interest Group"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Coordinating Analysis of Multi-Assay Experiments}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: true
---

# Installation

```{r, eval = FALSE}
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("MultiAssayExperiment")
```

Loading the packages:

```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE}
library(MultiAssayExperiment)
library(GenomicRanges)
library(SummarizedExperiment)
library(RaggedExperiment)
```

# A Brief Description

`MultiAssayExperiment` offers a data structure for representing and
analyzing multi-omics experiments: a biological analysis approach utilizing
multiple types of observations, such as DNA mutations and abundance of RNA
and proteins, in the same biological specimens.

## Choosing the appropriate data structure

For assays with different numbers of rows and even columns,
`MultiAssayExperiment` is recommended. For sets of assays with the same
information across all rows (e.g., genes or genomic ranges),
`SummarizedExperiment` is the recommended data structure.

# Overview of the `MultiAssayExperiment` class

Here is an overview of the class and its constructors and extractors:

```{r}
empty <- MultiAssayExperiment()
empty
slotNames(empty)
```

A visual representation of the MultiAssayExperiment class and its accessor
functions can be seen below. There are three main components:

1) `ExperimentList`
2) `colData`
3) `sampleMap`

```{r, echo = FALSE, fig.cap = "MultiAssayExperiment object schematic shows the design of the infrastructure class. The colData provides data about the patients, cell lines, or other biological units, with one row per unit and one column per variable. The experiments are a list of assay datasets of arbitrary class, with one column per observation. The sampleMap links a single table of patient data (colData) to a list of experiments via a simple but powerful table of experiment:patient edges (relationships), that can be created automatically in simple cases or in a spreadsheet if assay-specific sample identifiers are used. sampleMap relates each column (observation) in the assays (experiments) to exactly one row (biological unit) in colData; however, one row of colData may map to zero, one, or more columns per assay, allowing for missing and replicate assays. Green stripes indicate a mapping of one subject to multiple observations across experiments.", out.width = "\\maxwidth"}
knitr::include_graphics("MultiAssayExperiment.png")
```

## Components of the `MultiAssayExperiment`

### `ExperimentList`: experimental data

The `ExperimentList` slot and class is the container workhorse for the
`MultiAssayExperiment` class. It contains all the experimental data. It inherits
from class `S4Vectors::SimpleList` with one element/component per data type.

```{r}
class(experiments(empty)) # ExperimentList
```

The elements of the `ExperimentList` can contain **ID-based** and
**range-based** data. Requirements for all classes in the `ExperimentList`
are listed in the API.

The following base and Bioconductor classes are known to work as elements of
the ExperimentList:

- `base::matrix`: the base class, can be used for ID-based datasets such as
gene expression summarized per-gene, microRNA, metabolomics, or microbiome
data.

- `SummarizedExperiment::SummarizedExperiment`: A richer representation compared
to a ordinary matrix of ID-based datasets capable of storing additional assay-
level metadata.

- `Biobase::ExpressionSet`: A *legacy* representation of ID-based datasets,
supported for convenience and supplanted by `SummarizedExperiment`.

- `SummarizedExperiment::RangedSummarizedExperiment`: For rectangular
range-based datasets, one set of genomic ranges are assayed for multiple
samples. It can be used for gene expression, methylation, or other data
types that refer to genomic positions.

- `RaggedExperiment::RaggedExperiment`: For range-based datasets, such as
copy number and mutation data, the `RaggedExperiment` class can be used
to represent measurements by genomic positions.

#### Class requirements within `ExperimentList` container

See the [API section](#application-programming-interface-api) for details on
requirements for using other data classes. In general, data classes meeting
minimum requirements, including support for square bracket `[` subsetting and
`dimnames()` will work by default.

The datasets contained in elements of the `ExperimentList` can have:

* column names (required)
* row names (optional)

The column names correspond to samples, and are used to match assay data to
specimen metadata stored in `colData`.

The row names can correspond to a variety of features in the data including
but not limited to gene names, probe IDs, proteins, and named ranges.
Note that the existence of "row" names does *not* mean the data must be
rectangular or matrix-like.

Classes contained in the `ExperimentList` must support the following list of
methods:

- `[`: single square bracket subsetting, with a single comma. It is assumed
that values before the comma subset rows, and values after the comma subset
columns.
- `dimnames()` : corresponding to features (such as genes, proteins, etc.)
and experimental samples
- `dim()`: returns a vector of the number of rows and number of columns

### `colData`: primary data

The `MultiAssayExperiment` keeps one set of "primary" metadata that describes
the 'biological unit' which can refer to specimens, experimental subjects,
patients, etc. In this vignette, we will refer to each experimental subject as
a *patient*.

#### `colData` slot requirements

The `colData` dataset should be of class `DataFrame` but can accept a
`data.frame` class object that will be coerced.

In order to relate metadata of the biological unit, the row names of the
`colData` dataset must contain patient identifiers.

```{r}
patient.data <- data.frame(sex=c("M", "F", "M", "F"),
    age=38:41,
    row.names=c("Jack", "Jill", "Bob", "Barbara"))
patient.data
```

Key points:

- one row of `colData` *can* map to zero, one, or more columns in any `ExperimentList` element
- each row of `colData` *must* map to at least one column in at least one `ExperimentList` element.
- each column of each `ExperimentList` element *must* map to *exactly* one row of `colData`.

These relationships are defined by the [sampleMap](#sampleMap).

#### Note on the flexibility of the `DataFrame`

For many typical purposes the `DataFrame` and `data.frame` behave equivalently;
but the `Dataframe` is more flexible as it allows any vector-like data type
to be stored in its columns. The flexibility of the `DataFrame` permits, for
example, storing multiple dose-response values for a single cell line, even
if the number of doses and responses is not consistent across all cell lines.
Doses could be stored in one column of `colData` as a `SimpleList`, and
responses in another column, also as a `SimpleList`. Or, dose-response values
could be stored in a single column of `colData` as a two-column matrix for
each cell line.

### `sampleMap`: relating `colData` to multiple assays {#sampleMap}

The `sampleMap` is a `DataFrame` that relates the "primary" data
(`colData`) to the experimental assays:

```{r}
is(sampleMap(empty), "DataFrame") # TRUE
```

The `sampleMap` provides an unambiguous map from every experimental
observation to *one and only one* row in `colData`. It is, however, permissible
for a row of `colData` to be associated with multiple experimental observations
or no observations at all.  In other words, there is a "many-to-one" mapping
from experimental observations to rows of `colData`, and a "one-to-any-number"
mapping from rows of `colData` to experimental observations.

#### `sampleMap` structure

The `sampleMap` has three columns, with the following column names:

1. **assay** provides the names of the different experiments / assays
performed. These are user-defined, with the only requirement that the names
of the `ExperimentList`, where the experimental assays are stored, must be
contained in this column.

2. **primary** provides the "primary" sample names. All values in this column
must also be present in the rownames of `colData(MultiAssayExperiment)`.
In this example, allowable values in this column are "Jack", "Jill",
"Barbara", and "Bob".

3. **colname** provides the sample names used by experimental datasets, which
in practice are often different than the primary sample names. For each assay,
all column names must be found in this column. Otherwise, those assays would
be orphaned: it would be impossible to match them up to samples in the overall
experiment. As mentioned above, duplicate values are allowed, to represent
replicates with the same overall experiment-level annotation.

This design is motivated by the following situations:

1. It allows flexibility for any amount of technical replication and biological
replication (such as tumor and matched normal for a single patient) of
individual assays.
2. It allows missing observations (such as RNA-seq performed only for some of
the patients).
3. It allows the use of different identifiers to be used for patients /
specimens and for each assay. These different identifiers are matched
unambiguously, and consistency between them is maintained during subsetting
and re-ordering.

##### Instances where `sampleMap` isn't provided

If each assay uses the same colnames (i.e., if the same sample identifiers are
used for each experiment), a simple list of these datasets is sufficient for
the `MultiAssayExperiment` constructor function. It is not necessary for
them to have the same rownames or colnames:

```{r}
exprss1 <- matrix(rnorm(16), ncol = 4,
        dimnames = list(sprintf("ENST00000%i", sample(288754:290000, 4)),
                c("Jack", "Jill", "Bob", "Bobby")))
exprss2 <- matrix(rnorm(12), ncol = 3,
        dimnames = list(sprintf("ENST00000%i", sample(288754:290000, 4)),
                c("Jack", "Jane", "Bob")))
doubleExp <- list("methyl 2k"  = exprss1, "methyl 3k" = exprss2)
simpleMultiAssay <- MultiAssayExperiment(experiments=doubleExp)
simpleMultiAssay
```

In the above example, the user did not provide the `colData` argument so the
constructor function filled it with an empty `DataFrame`:

```{r}
colData(simpleMultiAssay)
```

But the `colData` can be provided. Here, note that any assay sample (column)
that cannot be mapped to a corresponding row in the provided `colData` gets
dropped. This is part of ensuring internal validity of the
`MultiAssayExperiment`.

```{r}
simpleMultiAssay2 <- MultiAssayExperiment(experiments=doubleExp,
                                          colData=patient.data)
simpleMultiAssay2
colData(simpleMultiAssay2)
```

### metadata

Metadata can be added at different levels of the `MultiAssayExperiment`.

Can be of *ANY* class, for storing study-wide metadata, such as citation
information. For an empty `MultiAssayExperiment` object, it is NULL.

```{r}
class(metadata(empty)) # NULL (class "ANY")
```

At the `ExperimentList` level, the `metadata` function would allow the user to
enter metadata as a `list`.

```{r}
metadata(experiments(empty))
```

At the individual assay level, certain classes _may_ support metadata, for
example, `metadata` and `mcols` for a `SummarizedExperiment`. It is recommended
to use `metadata` at the `ExperimentList` level.

<p style="text-align: right;"> <a href="#header">back to top</a> </p>

# Creating a `MultiAssayExperiment` object: a rich example

In this section we demonstrate all core supported data classes, using different
sample ID conventions for each assay, with primary `colData`. The some
supported classes such as, `matrix`, `SummarizedExperiment`, and `RangedSummarizedExperiment`.

## Create toy datasets demonstrating all supported data types

We have three matrix-like datasets. First, let's represent expression data as
a `SummarizedExperiment`:

```{r, message=FALSE}
(arraydat <- matrix(seq(101, 108), ncol=4,
    dimnames=list(c("ENST00000294241", "ENST00000355076"),
    c("array1", "array2", "array3", "array4"))))

coldat <- data.frame(slope53=rnorm(4),
    row.names=c("array1", "array2", "array3", "array4"))

exprdat <- SummarizedExperiment(arraydat, colData=coldat)
exprdat
```

The following map matches `colData` sample names to `exprdata` sample
names. Note that row orders aren't initially matched up, and this is OK.

```{r}
(exprmap <- data.frame(primary=rownames(patient.data)[c(1, 2, 4, 3)],
                       colname=c("array1", "array2", "array3", "array4"),
                       stringsAsFactors = FALSE))
```

Now methylation data, which we will represent as a `matrix`. It uses
gene identifiers also, but measures a partially overlapping set of genes.
Now, let's store this as a simple matrix which can contains a replicate
for one of the patients.

```{r}
(methyldat <-
   matrix(1:10, ncol=5,
          dimnames=list(c("ENST00000355076", "ENST00000383706"),
                        c("methyl1", "methyl2", "methyl3",
                          "methyl4", "methyl5"))))
```

The following map matches `colData` sample names to `methyldat` sample
names.

```{r}
(methylmap <- data.frame(primary = c("Jack", "Jack", "Jill", "Barbara", "Bob"),
    colname = c("methyl1", "methyl2", "methyl3", "methyl4", "methyl5"),
    stringsAsFactors = FALSE))
```

Now we have a microRNA platform, which has no common identifiers with the
other datasets, and which we also represent as a `matrix`. It
is also missing data for "Jill". We will use the same sample naming
convention as we did for arrays.

```{r}
(microdat <- matrix(201:212, ncol=3,
                    dimnames=list(c("hsa-miR-21", "hsa-miR-191",
                                    "hsa-miR-148a", "hsa-miR148b"),
                                  c("micro1", "micro2", "micro3"))))
```

And the following map matches `colData` sample names to `microdat` sample names.

```{r}
(micromap <- data.frame(primary = c("Jack", "Barbara", "Bob"),
    colname = c("micro1", "micro2", "micro3"), stringsAsFactors = FALSE))
```

Finally, we create a dataset of class `RangedSummarizedExperiment`:

```{r}
nrows <- 5; ncols <- 4
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
rowRanges <- GRanges(rep(c("chr1", "chr2"), c(2, nrows - 2)),
    IRanges(floor(runif(nrows, 1e5, 1e6)), width=100),
    strand=sample(c("+", "-"), nrows, TRUE),
    feature_id=sprintf("ID\\%03d", 1:nrows))
names(rowRanges) <- letters[1:5]
colData <- DataFrame(Treatment=rep(c("ChIP", "Input"), 2),
    row.names= c("mysnparray1", "mysnparray2", "mysnparray3", "mysnparray4"))
rse <- SummarizedExperiment(assays=SimpleList(counts=counts),
    rowRanges=rowRanges, colData=colData)
```

And we map the `colData` samples to the `RangedSummarizedExperiment`:

```{r}
(rangemap <-
    data.frame(primary = c("Jack", "Jill", "Bob", "Barbara"),
    colname = c("mysnparray1", "mysnparray2", "mysnparray3", "mysnparray4"),
        stringsAsFactors = FALSE))
```

## `sampleMap` creation

The `MultiAssayExperiment` constructor function can create the `sampleMap`
automatically if a single naming convention is used, but in this example
it cannot because we used platform-specific sample identifiers
(e.g. mysnparray1, etc). So we must provide an ID map that matches the
samples of each experiment back to the `colData`, as a three-column
`data.frame` or `DataFrame` with three columns named "assay", primary", and
"colname". Here we start with a list:

```{r}
listmap <- list(exprmap, methylmap, micromap, rangemap)
names(listmap) <- c("Affy", "Methyl 450k", "Mirna", "CNV gistic")
listmap
```

and use the convenience function `listToMap` to convert the list of
`data.frame` objects to a valid object for the `sampleMap`:

```{r}
dfmap <- listToMap(listmap)
dfmap
```

Note, `dfmap` can be reverted to a list with another provided function:

```{r, eval=FALSE}
mapToList(dfmap, "assay")
```

## Experimental data as a `list()`

Create an named list of experiments for the `MultiAssayExperiment` function.
All of these names must be found within in the third column of `dfmap`:

```{r}
objlist <- list("Affy" = exprdat, "Methyl 450k" = methyldat,
    "Mirna" = microdat, "CNV gistic" = rse)
```

## Creation of the `MultiAssayExperiment` class object

We recommend using the `MultiAssayExperiment` constructor function:

```{r}
myMultiAssay <- MultiAssayExperiment(objlist, patient.data, dfmap)
myMultiAssay
```

The following extractor functions can be used to get extract data from
the object:

```{r}
experiments(myMultiAssay)
colData(myMultiAssay)
sampleMap(myMultiAssay)
metadata(myMultiAssay)
```

Note that the `ExperimentList` class extends the `SimpleList` class to add some
validity checks specific to `MultiAssayExperiment`.  It can be used like
a list.

## Helper function to create a `MultiAssayExperiment` object

The `prepMultiAssay` function helps diagnose common problems when creating a
`MultiAssayExperiment` object. It provides error messages and/or warnings in
instances where names (either `colnames` or `ExperimentList` element names) are
inconsistent with those found in the sampleMap. Input arguments are the same
as those in the `MultiAssayExperiment` (i.e., `ExperimentList`, `colData`,
`sampleMap`). The resulting output of the `prepMultiAssay` function is a list
of inputs including a "metadata$drops" element for names that were not able to be
matched.

Instances where `ExperimentList` is created without names will prompt an error
from `prepMultiAssay`. Named `ExperimentList` elements are essential for checks
in `MultiAssayExperiment`.

```{r}
objlist3 <- objlist
(names(objlist3) <- NULL)

try(prepMultiAssay(objlist3, patient.data, dfmap)$experiments,
    outFile = stdout())
```

Non-matching names may also be present in the `ExperimentList` elements and the
"assay" column of the `sampleMap`. If names only differ by case and are
identical and unique, names will be standardized to lower case and replaced.

```{r}
names(objlist3) <- toupper(names(objlist))
names(objlist3)
unique(dfmap[, "assay"])
prepMultiAssay(objlist3, patient.data, dfmap)$experiments
```

When `colnames` in the `ExperimentList` cannot be matched back to the primary
data (`colData`), these will be dropped and added to the drops element.

```{r}
exampleMap <- sampleMap(simpleMultiAssay2)
sapply(doubleExp, colnames)
exampleMap
prepMultiAssay(doubleExp, patient.data, exampleMap)$metadata$drops
```

A similar operation is performed for checking "primary" `sampleMap` names and
`colData` rownames. In this example, we add a row corresponding to "Joe" that
does not have a match in the experimental data.

```{r}
exMap <- rbind(dfmap,
    DataFrame(assay = "New methyl", primary = "Joe",
        colname = "Joe"))
invisible(prepMultiAssay(objlist, patient.data, exMap))
```

To create a `MultiAssayExperiment` from the results of the `prepMultiAssay`
function, take each corresponding element from the resulting list and enter
them as arguments to the `MultiAssayExperiment` constructor function.

```{r}
prepped <- prepMultiAssay(objlist, patient.data, exMap)
preppedMulti <- MultiAssayExperiment(prepped$experiments, prepped$colData,
    prepped$sampleMap, prepped$metadata)
preppedMulti
```

Alternatively, use the `do.call` function to easily create a `MultiAssayExperiment`
from the output of `prepMultiAssay` function:

```{r}
do.call(MultiAssayExperiment, prepped)
```

## Helper functions to create `Bioconductor` classes from raw data

Recent updates to the `GenomicRanges` and `SummarizedExperiment` packages
allow the user to create standard _Bioconductor_ classes from raw data. Raw
data read in as either `data.frame` or `DataFrame` can be converted to
`GRangesList` or `SummarizedExperiment` classes depending on the type of data.

The function to create a `GRangesList` from a `data.frame`, called
`makeGRangesListFromDataFrame` can be found in the `GenomicRanges` package.
`makeSummarizedExperimentFromDataFrame` is available in the
`SummarizedExperiment` package. It is also possible to create a
`RangedSummarizedExperiment` class object from raw data when ranged data is
available.

A simple example can be obtained from the function documentation in
`GenomicRanges`:

```{r}
grlls <- list(chr = rep("chr1", nrows), start = seq(11, 15),
    end = seq(12, 16), strand = c("+", "-", "+", "*", "*"),
    score = seq(1, 5), specimen = c("a", "a", "b", "b", "c"),
    gene_symbols = paste0("GENE", letters[seq_len(nrows)]))

grldf <- as.data.frame(grlls, stringsAsFactors = FALSE)

GRL <- makeGRangesListFromDataFrame(grldf, split.field = "specimen",
    names.field = "gene_symbols")
```

This can then be converted to a `RaggedExperiment` object for a
rectangular representation that will conform more easily to the
`MultiAssayExperiment` API requirements.

```{r}
RaggedExperiment(GRL)
```

_Note_. See the `RaggedExperiment` vignette for more details.

In the `SummarizedExperiment` package:

```{r}
sels <- list(chr = rep("chr2", nrows), start = seq(11, 15),
    end = seq(12, 16), strand = c("+", "-", "+", "*", "*"),
    expr0 = seq(3, 7), expr1 = seq(8, 12), expr2 = seq(12, 16))
sedf <- as.data.frame(sels,
    row.names = paste0("GENE", letters[rev(seq_len(nrows))]),
    stringsAsFactors = FALSE)
sedf
makeSummarizedExperimentFromDataFrame(sedf)
```

<p style="text-align: right;"> <a href="#header">back to top</a> </p>

# Integrated subsetting across experiments

`MultiAssayExperiment` allows subsetting by rows, columns, and assays,
rownames, and colnames, across all experiments simultaneously while
guaranteeing continued matching of samples.

Subsetting can be done most compactly by the square bracket method, or more
verbosely and potentially more flexibly by the `subsetBy*()` methods.

## Subsetting by square bracket `[`

The three positions within the bracket operator indicate rows, columns, and
assays, respectively (pseudocode):

```{r, eval=FALSE}
myMultiAssay[rows, columns, assays]
```

For example, to select the gene "ENST00000355076":

```{r}
myMultiAssay["ENST00000355076", , ]
```

The above operation works across all types of assays, whether ID-based
(e.g. `matrix`, `ExpressionSet`, `SummarizedExperiment`) or range-based
(e.g. `RangedSummarizedExperiment`). Note that when using
the bracket method `[`, the drop argument is *TRUE* by default.

You can subset by rows, columns, and assays in a single bracket operation,
and they will be performed in that order (rows, then columns, then assays).
The following selects the `ENST00000355076` gene across all samples, then the
first two samples of each assay, and finally the Affy and Methyl 450k assays:

```{r}
myMultiAssay["ENST00000355076", 1:2, c("Affy", "Methyl 450k")]
```

## Subsetting by character, integer, and logical

By columns - character, integer, and logical are all allowed, for example:
```{r}
myMultiAssay[, "Jack", ]
myMultiAssay[, 1, ]
myMultiAssay[, c(TRUE, FALSE, FALSE, FALSE), ]
```

By assay - character, integer, and logical are allowed:
```{r}
myMultiAssay[, , "Mirna"]
myMultiAssay[, , 3]
myMultiAssay[, , c(FALSE, FALSE, TRUE, FALSE, FALSE)]
```

## the "drop" argument

Specify `drop=FALSE` to keep assays with zero rows or zero columns, e.g.:

```{r}
myMultiAssay["ENST00000355076", , , drop=FALSE]
```

Using the default `drop=TRUE`, assays with no rows or no columns are removed:

```{r}
myMultiAssay["ENST00000355076", , , drop=TRUE]
```

## More on subsetting by columns

Experimental samples are stored in the rows of `colData` but the columns of
elements of `ExperimentList`, so when we refer to subsetting by columns, we
are referring to columns of the experimental assays. Subsetting by samples /
columns will be more obvious after recalling the `colData`:

```{r}
colData(myMultiAssay)
```

Subsetting by samples identifies the selected samples in rows of the colData
DataFrame, then selects all columns of the `ExperimentList` corresponding to
these rows. Here we use an integer to keep the first two rows of colData, and
all experimental assays associated to those two primary samples:

```{r}
myMultiAssay[, 1:2]
```

Note that the above operation keeps different numbers of columns / samples
from each assay, reflecting the reality that some samples may not have been
assayed in all experiments, and may have replicates in some.

Columns can be subset using a logical vector. Here the dollar sign operator
(`$`) accesses one of the columns in `colData`.

```{r}
malesMultiAssay <- myMultiAssay[, myMultiAssay$sex == "M"]
colData(malesMultiAssay)
```

Finally, for special use cases you can exert detailed control of row or column
subsetting, by using a `list` or `CharacterList` to subset. The following
creates a `CharacterList` of the column names of each assay:

```{r}
allsamples <- colnames(myMultiAssay)
allsamples
```

Now let's get rid of three Methyl 450k arrays, those in positions 3, 4, and 5:

```{r}
allsamples[["Methyl 450k"]] <- allsamples[["Methyl 450k"]][-3:-5]
myMultiAssay[, as.list(allsamples), ]
subsetByColumn(myMultiAssay,  as.list(allsamples))  #equivalent
```

## Subsetting assays

You can select certain assays / experiments using subset, by providing a
character, logical, or integer vector.  An example using character:

```{r}
myMultiAssay[, , c("Affy", "CNV gistic")]
```

You can subset assays also using logical or integer vectors:

```{r}
is.cnv <- grepl("CNV", names(experiments(myMultiAssay)))
is.cnv
myMultiAssay[, , is.cnv]  #logical subsetting
myMultiAssay[, , which(is.cnv)] #integer subsetting
```

## Subsetting rows (features) by IDs, integers, or logicals

Rows of the assays correspond to assay features or measurements, such as genes.
Regardless of whether the assay is ID-based (e.g., `matrix`, `ExpressionSet`) or
range-based (e.g., `RangedSummarizedExperiment`), they can be
subset using any of the following:

- a **character vector** of IDs that will be matched to rownames in each assay

- an **integer vector** that will select rows of this position from each assay.
This probably doesn't make sense unless every `ExperimentList` element
represents the same measurements in the same order and will generate an error
if any of the integer elements exceeds the number of rows in any
`ExperimentList` element. The most likely use of integer subsetting would be
as a `head` function, for example to look at the first 6 rows of each assay.

- a **logical vector** that will be passed directly to the row subsetting
operation for each assay.

- a **list** or **List** with element names matching those in the
`ExperimentList`. Each element of the subsetting list will be passed on exactly
to subset rows of the corresponding element of the `ExperimentList`.

Any `list` or `List` input allows for selective subsetting. The subsetting is
applied only to the matching element names in the `ExperimentList`. For
example, to only take the first two rows of the microRNA dataset, we use a
named `list` to indicate what element we want to subset along with the
`drop = FALSE` argument.

```{r}
myMultiAssay[list(Mirna = 1:2), , ]
## equivalently
subsetByRow(myMultiAssay, list(Mirna = 1:2))
```

Again, these operations always return a `MultiAssayExperiment` class, unless
`drop=TRUE` is passed to the `[` backet subset, with any `ExperimentList`
element not containing the feature having zero rows.

For example, return a MultiAssayExperiment where `Affy` and `Methyl 450k`
contain only "ENST0000035076"" row, and "Mirna" and "CNV gistic" have zero
rows (`drop` argument is set to `FALSE` by default in `subsetBy*`):

```{r}
featSub0 <- subsetByRow(myMultiAssay, "ENST00000355076")
featSub1 <- myMultiAssay["ENST00000355076", , drop = FALSE] #equivalent
all.equal(featSub0, featSub1)
class(featSub1)
class(experiments(featSub1))
experiments(featSub1)
```

In the following, `Affy` `SummarizedExperiment` keeps both rows but with their
order reversed, and `Methyl 450k` keeps only its second row.

```{r}
featSubsetted <-
  subsetByRow(myMultiAssay, c("ENST00000355076", "ENST00000294241"))
assay(myMultiAssay, 1L)
assay(featSubsetted, 1L)
```

## Subsetting rows (features) by `GenomicRanges`

For `MultiAssayExperiment` objects containing range-based objects (currently
`RangedSummarizedExperiment`), these can be subset
using a `GRanges` object, for example:

```{r}
gr <- GRanges(seqnames = c("chr1", "chr1", "chr2"), strand = c("-", "+", "+"),
              ranges = IRanges(start = c(230602, 443625, 934533),
                               end = c(330701, 443724, 934632)))
```

Now do the subsetting. The function doing the work here is
`IRanges::subsetByOverlaps` - see its arguments for flexible types of
subsetting by range. The first three arguments here are for `subset`, the
rest passed on to `IRanges::subsetByOverlaps` through "...":

```{r}
subsetted <- subsetByRow(myMultiAssay, gr, maxgap = 2L, type = "within")
experiments(subsetted)
rowRanges(subsetted[[4]])
```

Square bracket subsetting can still be used here, but passing on arguments to
`IRanges::subsetByOverlaps` through "..." is simpler using `subsetByRow()`.

## Subsetting is endomorphic

`subsetByRow`, `subsetByColumn`, `subsetByAssay`, and square bracket subsetting
are all "endomorphic" operations, in that they always return another
`MultiAssayExperiment` object.

## Double-bracket subsetting to select experiments

A double-bracket subset operation refers to an experiment, and will return
the object contained within an `ExperimentList` element. It is **not**
endomorphic. For example, the first `ExperimentList` element is called "Affy"
and contains a `SummarizedExperiment`:

```{r}
names(myMultiAssay)
myMultiAssay[[1]]
myMultiAssay[["Affy"]]
```

<p style="text-align: right;"> <a href="#header">back to top</a> </p>

# Helpers for data clean-up and management

## `complete.cases`

The `complete.cases` function returns a logical vector of `colData` rows
identifying which primary units have data for all experiments. Recall that
`myMultiAssay` provides data for four individuals:

```{r}
colData(myMultiAssay)
```

Of these, only Jack has data for all 5 experiments:

```{r}
complete.cases(myMultiAssay)
```

But all four have complete cases for Affy and Methyl 450k:

```{r}
complete.cases(myMultiAssay[, , 1:2])
```

This output can be used to select individuals with complete data:

```{r}
myMultiAssay[, complete.cases(myMultiAssay), ]
```

## `replicated`

The `replicated` function identifies `primary` column values or biological units
that have multiple observations per `assay`. It returns a `list` of
`LogicalList`s that indicate what biological units have one or more replicate
measurements. This output is used for merging replicates by default.

```{r}
replicated(myMultiAssay)
```

## `intersectRows`

The `intersectRows` function takes all common rownames across all experiments
and returns a `MultiAssayExperiment` with those rows.

```{r}
(ensmblMatches <- intersectRows(myMultiAssay[, , 1:2]))
rownames(ensmblMatches)
```

## `intersectColumns`

A call to `intersectColumns` returns another `MultiAssayExperiment` where the
columns of each element of the `ExperimentList` correspond exactly to the rows
of `colData`. In many cases, this operation returns a 1-to-1 correspondence of samples to
patients for each experiment assay unless replicates are present in the data.

```{r}
intersectColumns(myMultiAssay)
```

## `mergeReplicates`

The `mergeReplicates` function allows the user to specify a function (default:
`mean`) for combining replicate columns in each assay element. This can be
combined with `intersectColumns` to create a `MultiAssayExperiment` object with
one measurement in each experiment per biological unit.

```{r}
mergeReplicates(intersectColumns(myMultiAssay))
```

## combine `c`

The combine `c` function allows the user to append an experiment to the list
of experiments already present in `MultiAssayExperiment`.
In the case that additional observations on the same set of samples were
performed, the `c` function can conveniently be referenced to an existing assay
that contains the same ordering of sample measurements.

The `mapFrom` argument indicates what experiment has the exact same
organization of samples that will be introduced by the new experiment
dataset. If the number of columns in the new experiment do not match
those in the reference experiment, an error will be thrown.

Here we introduce a toy dataset created on the fly:

```{r}
c(myMultiAssay, ExpScores = matrix(1:8, ncol = 4,
dim = list(c("ENSMBL0001", "ENSMBL0002"), paste0("pt", 1:4))),
mapFrom = 1L)
```

_Note_: Alternatively, a `sampleMap` for the additional dataset can be provided.

<p style="text-align: right;"> <a href="#header">back to top</a> </p>

# Extractor functions

Extractor functions convert a `MultiAssayExperiment` into other forms that are
convenient for analyzing. These would normally be called after any desired
subsetting has been performed.

## `getWithColData`

Provides a single assay along with any associated 'colData' columns while
keeping the assay class constant.

```{r}
(affex <- getWithColData(myMultiAssay, 1L))
colData(affex)
class(affex)
```

It will error when the target data class does not support a `colData`
replacement method, meaning that it typically works with `SummarizedExperiment`
and `RaggedExperiment` assays and their extensions.

## `longFormat` & `wideFormat`

Produces *long* (default) or *wide* `DataFrame` objects. The following
produces a long `DataFrame` (the default) for the first two assays:

```{r}
longFormat(myMultiAssay[, , 1:2])
```

This is especially useful for performing regression against patient or sample
data from `colData` using the `pDataCols` argument:

```{r}
longFormat(myMultiAssay[, , 1:2], colDataCols="age")
```

The "wide" format is useful for calculating correlations or performing
regression against different genomic features. Wide format is in general not
possible with replicate measurements, so we demonstrate on the cleaned
`MultiAssayExperiment` for the first 5 columns:

```{r}
maemerge <- mergeReplicates(intersectColumns(myMultiAssay))
wideFormat(maemerge, colDataCols="sex")[, 1:5]
```

## `assay` / `assays`

The `assay` (singular) function takes a particular experiment and returns
a matrix. By default, it will return the _first_ experiment as a matrix.

```{r}
assay(myMultiAssay)
```

The `assays` (plural) function returns a `SimpleList` of data matrices from the
`ExperimentList`:

```{r}
assays(myMultiAssay)
```

# The Cancer Genome Atlas and MultiAssayExperiment

Our most recent efforts include the release of the experiment data package,
`curatedTCGAData`. This package will allow users to selectively download
cancer datasets from The Cancer Genome Atlas (TCGA) and represent the data
as `MultiAssayExperiment` objects. Please see the package vignette for more
details.

```{r, eval = FALSE}
BiocManager::install("curatedTCGAData")
```

# Dimension names: `rownames` and `colnames`

`rownames` and `colnames` return a `CharacterList` of row names and column names
across all the assays.  A `CharacterList` is an efficient alternative to
`list` used when each element contains a character vector. It also provides a
nice show method:

```{r}
rownames(myMultiAssay)
colnames(myMultiAssay)
```

<p style="text-align: right;"> <a href="#header">back to top</a> </p>

# Requirements for support of additional data classes

Any data classes in the `ExperimentList` object must support the following
methods:

* `dimnames`
* `[`
* `dim()`

Here is what happens if one of the methods doesn't:

```{r}
objlist2 <- objlist
objlist2[[2]] <- as.vector(objlist2[[2]])

try(MultiAssayExperiment(objlist2, patient.data, dfmap),
    outFile = stdout())
```

# Application Programming Interface (API)

For more information on the formal API of `MultiAssayExperiment`, please see
the [API wiki][] document on GitHub. An API package is available for
download on GitHub via `install("waldronlab/MultiAssayShiny")`. It provides
visual exploration of available methods in `MultiAssayExperiment`.

<p style="text-align: right;"> <a href="#header">back to top</a> </p>

# Methods for MultiAssayExperiment

The following methods are defined for `MultiAssayExperiment`:

```{r}
methods(class="MultiAssayExperiment")
```
# Citing MultiAssayExperiment

We are excited to announce the official citation for MultiAssayExperiment
in *Cancer Research*.

```{r}
citation("MultiAssayExperiment")
```

# sessionInfo()

```{r}
sessionInfo()
```

<p style="text-align: right;"> <a href="#header">back to top</a> </p>

[API wiki]: https://github.com/waldronlab/MultiAssayExperiment/wiki/MultiAssayExperiment-API