---
title: "flowWorkspace Introduction: A Package to store and maninpulate gated flow data"
output:
  html_document:
    number_sections: yes
    theme: united
    toc: yes
    toc_float: yes
  pdf_document:
    toc: yes
author: Greg Finak <gfinak@fhcrc.org>, Mike Jiang <wjiang2@fhcrc.org>
vignette: >    
  %\VignetteKeywords{flow cytometry, single cell assay, import}
  %\VignettePackage{flowWorkspace}  
  %\VignetteIndexEntry{flowWorkspace Introduction: A Package to store and maninpulate gated flow data}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, results = "markup", message = FALSE)
```


## Purpose
The purpose of this package is to provide the infrastructure to store, represent and exchange gated flow data. By this we mean accessing the samples, groups, transformations, compensation matrices, gates, and population statistics in the gating tree, which is represented as a `GatingSet` object in `R`.

The `GatingSet` can be built from scratch within `R` or imported from `flowJo` XML workspaces (i.e. `.xml` or `.wsp` files) or `GatingML` files . 
Note that we cannot import `.jo` files directly. You will have to save them in XML workspace format.

## Import flowJo workspace
The following section walks through opening and importing a flowJo workspace.

### Opening a Workspace
We represent flowJo workspaces using `flowjo_workspace` objects. We only need to know the path to, and filename of the flowJo workspace.

```{r findxml}
library(flowWorkspace)
path <- system.file("extdata",package="flowWorkspaceData")
wsfile <- list.files(path, pattern="A2004Analysis.xml", full = TRUE)
```

In order to open this workspace we need `CytoML` package:
```{r openws,results='markup'}
library(CytoML)
ws <- open_flowjo_xml(wsfile)
ws
```

We see that this a version 2.0 workspace file. It's location and filename are printed. Additionally, you are notified that the workspace file is open. This refers to the fact that the XML document is internally represented using 'C' data structures from the `XML` package. After importing the file, the workspace must be explicitly closed using `flowjo_ws_close()` in order to free up that memory.

For example, the list of samples in a workspace can be accessed by:
```{r getsamples-ws}
fj_ws_get_samples(ws)
```
The `compID` column tells you which compensation matrix to apply to a group of files, and similarly, based on the name of the compensation matrix, which transformations to apply.


And the groups can be accessed by:
```{r getgroups}
fj_ws_get_sample_groups(ws)
```

Keywords stored in an XML workspace can also retrieved by:
```{r}
sn <- "a2004_O1T2pb05i_A1_A01.fcs"
fj_ws_get_keywords(ws, sn)[1:5]
```

### Parsing the Workspace
These are all retrieved by directly querying xml file. In order to get more information about the gating tree, we need to actually parse the XML workspace into R data structures to represent some of the information therein. Specifically, by calling `flowjo_to_gatingset()` the user will be presented with a list of `groups` in the workspace file and need to choose one group to import. Why only one? Because of the way flowJo handles data transformation and compensation. Each group of samples is associated with a compensation matrix and specific data transformation. These are applied to all samples in the group. When a particular group of samples is imported, the package generates a `GatingHierarchy` for each sample, describing the set of gates applied to the data (note: polygons, rectangles, quadrants, and ovals and boolean gates are supported). The set of GatingHierarchies for the group of samples is stored in a `GatingSet` object. Calling `flowjo_to_gatingset()` is quite verbose, informing the user as each gate is created. The parsing can also be done non--interactively by specifying which group to import directly in the function call (either an index or a group name). An additional optional argument `execute=T/F` specifies whether you want to load, compensate, transform the data and compute statistics immediately after parsing the XML tree. Argument `path` can be used to specify where the FCS files are stored.

```{r parsews,message=FALSE, warning=FALSE}
gs <- flowjo_to_gatingset(ws,name = 1); #import the first group
#Lots of output here suppressed for the vignette.
gs
```
We have generated a `GatingSet` with `r length(gs)` samples, each  of which has `r length(gs_get_pop_paths(gs))-1` associated gates.

To list the samples stored in `GatingSet`:
```{r sampleNames}
sampleNames(gs)
```
Note that it is different from the previous call `fj_ws_get_samples` on `workspace` where the latter list all samples stored in `xml` file and here are the ones actually get parsed. Because sometime not all of samples in xml will be imported for various reason. Also we've seen an extra string `_xxx` is attached to the end of each sample name. It is due to the argument `additional.keys` has the default value set to `'$TOT'`. See more details on these parsing options from [How to parse a flowJo workspace](http://bioconductor.org/packages/release/bioc/vignettes/flowWorkspace/inst/doc/HowToParseFlowJo.html).

## Import gatingML
We currently support gatingML2.0 files exported from the `Cytobank` system. Parsing can be done with one convenient function, `cytobank2GatingSet` from the `CytoML` package, that simply takes file paths of `gatingML` and `FCS`.

```{r parseGatingML, eval=FALSE}
xmlfile <- system.file("extdata/cytotrol_tcell_cytobank.xml", package = "CytoML")
fcsFiles <- list.files(pattern = "CytoTrol", 
                       system.file("extdata", package = "flowWorkspaceData"), full = T)
gs1 <- cytobank2GatingSet(xmlfile, fcsFiles)
```
If you want to dive into the details and sub-steps of the parsing process, see the vignette of [CytoML](https://github.com/RGLab/CytoML).
  
## Basics on GatingSet

Subsets of a `GatingSet` can be accessed using the standard R subset syntax `[`.
```{r subset}
gs[1]
```
At this point we have parsed the workspace file and generated the gating hierarchy associated with each sample imported from the file. The data have been loaded, compensated, and transformed in the workspace, and gating has been executed. The resulting `GatingSet` contains a replicated analysis of the original flowJo workspace. It should be noted, however, that because `GatingSet` is a purely reference class, this sort of subsetting does not copy the underlying data but rather utilizes a view of it.

We can plot the gating tree:
```{r plotTree}
plot(gs)
```

We can list the nodes (populations) in the gating hierarchy:
```{r gs_get_pop_paths-path-1}
gs_get_pop_paths(gs, path = 1)
```
Note that the `path` argument specifies the depth of the gating path for each population. 
As shown, `depth` of `1` (i.e. leaf or terminal node name) may not be sufficient to uniquely identify each population. The issue can be resolved by increasing the `path` or simply returning the full path of the node:
```{r gs_get_pop_paths-path-full}
gs_get_pop_paths(gs, path = "full")
```
But `full` path may not be neccessary and could be too long to be visualized. So we provide the `path = 'auto'` option to determine the shortest path that is still unique within the gating tree.
```{r gs_get_pop_paths-path-auto}
nodelist <- gs_get_pop_paths(gs, path = "auto")
nodelist
```

We can get the gate associated with the specific population:

```{r gh_pop_get_gate}
node <- nodelist[3]
g <- gs_pop_get_gate(gs, node)
g
```
We can retrieve the population statistics :
```{r getStats}
gs_pop_get_count_fast(gs)[1:10,]
```
We can plot individual gates. Note the scale of the transformed axes.
The second argument is the node path of any depth as long as it is uniquely identifiable.
```{r autoplot-nodeName}
library(ggcyto)
autoplot(gs, "pDC")
```
More details about gate visualization can be found [here](http://bioconductor.org/packages/release/bioc/html/ggcyto.html).

If we have  metadata associated with the experiment, it can be attached to the `GatingSet`.
```{r annotate}
d <- data.frame(sample=factor(c("sample 1", "sample 2")),treatment=factor(c("sample","control")) )
pd <- pData(gs)
pd <- cbind(pd,d)
pData(gs) <- pd
pData(gs)
```
We can subset the `GatingSet` by its `pData` directly:
```{r}
subset(gs, treatment == "control")
```

The underlying `flow data` can be retrieved by:
```{r}
cs <- gs_pop_get_data(gs)
class(cs)
nrow(cs[[1]])
```
Because `GatingSet` is a purely reference class, the class type returned by `getData` is a `cytoset`, which is the purely reference class analog of a `flowSet` and will be discussed in more detail below. Also note that the data is already compensated and transformed during the parsing.

We can retrieve the subset of data associated with a population node:
```{r getData-gh}
cs <- gs_pop_get_data(gs, node)
nrow(cs[[1]])
```
## GatingHierarchy
We can retrieve a single gating hierarchical tree (corresponding to one sample) by using the `[[` extraction operator
```{r gh}
gh <- gs[[1]]
gh
```
Note that the index can be either numeric or character (the `guid` returned by the `sampleNames` method)

We can do similar operations on this `GatingHierarchy` object and the same methods behave differently from `GatingSet` 
```{r}
head(gh_pop_compare_stats(gh))
```
Here `gh_pop_compare_stats` returns both the stats directly stored in `flowJo` xml workspace and one calcuated by `GatingSet` through the gating. There are could be minor difference between the two due to the numerical errors. However the difference should not be significant. Therore this can be used as the validity check for the parsing accuracy.
```{r gh_plot_pop_count_cv}
gh_plot_pop_count_cv(gh)
```

The `autoplot` method without specifying any node will lay out all the gates in the same plot
```{r}
autoplot(gh)
```

We can retrieve the indices specifying if an event is included inside or outside a gate using:
```{r getInd}
table(gh_pop_get_indices(gh,node))
```
The indices returned are relative to the parent population (member of parent AND member of current gate), so they reflect the true hierarchical gating structure.

We can retrieve all the compensation matrices from the `GatingHierarchy` in case we wish to use the compensation or transformation for the new data, 

```{r getCMAT}
C <- gh_get_compensations(gh);
C
```

Or we can retrieve transformations:

```{r getTrans,results='markup'}
T <- gh_get_transformations(gh)
names(T)
T[[1]]
```

`gh_get_transformations` returns a list of functions to be
applied to different dimensions of the data. 
Above, the transformation is applied to this sample, the appropriate dimension
is transformed using a channel--specific function from the list.


## Build the gating hierarchy from scratch
`GatingSet` provides methods to build a gating tree from raw FCS files and add or remove flowCore gates (or populations) to or from it.

We start from a `flowSet` that contains three ungated flow samples:
```{r create gs}
library(flowCore)
data(GvHD)
#select raw flow data
fs <- GvHD[1:2]
```
Then construct a `GatingSet` from the `flowSet`:
```{r GatingSet constructor}
gs <- GatingSet(fs)
```

Then compensate it:
```{r compensate}
cfile <- system.file("extdata","compdata","compmatrix", package="flowCore")
comp.mat <- read.table(cfile, header=TRUE, skip=2, check.names = FALSE)
## create a compensation object 
comp <- compensation(comp.mat)
#compensate GatingSet
gs <- compensate(gs, comp)
```

<span style = "color:red">**New**: You can now pass a list of `compensation` objects with elements named by `sampleNames(gs)` to achieve sample-specific compensations. e.g. </span>
```{r eval=FALSE}
gs <- compensate(gs, comp.list)
```


Then we can transform it with any transformation defined by the user through `trans_new` function of `scales` package.
```{r user-transformation}
require(scales)
trans.func <- asinh
inv.func <- sinh
trans.obj <- trans_new("myAsinh", trans.func, inv.func)
```
The `inverse` transformation is required so that the gates and data can be visualized in `transformed` scale while the axis label still remains in the raw scale. Optionally, the `breaks` and `format` functions can be supplied to further customize the appearance of axis labels.

Besides doing all these by hand, we also provide some buildin transformations: `asinhtGml2_trans`, `flowjo_biexp_trans`, `flowjo_fasinh_trans` and `logicle_trans`. These are all very commonly used transformations in flow data analysis. User can construct the transform object by simply one-line of code. e.g.
```{r transform-build-in}
trans.obj <- asinhtGml2_trans()
trans.obj
```

Once a `transformer` object is created, we must convert it to `transformerList` for `GatingSet` to use.
```{r transformerList}
chnls <- colnames(fs)[3:6] 
transList <- transformerList(chnls, trans.obj)
```
Alternatively, the overloaded `estimateLogicle` method can be used directly on `GatingHierarchy` to generate a `transformerList` object automatically.
```{r estimateLogicle}
estimateLogicle(gs[[1]], chnls)
```

Now we can transform our `GatingSet` with this `transformerList` object. It will also store the transformation in the `GatingSet` and can be used to inverse-transform the data.
```{r transform-gs}
gs <- transform(gs, transList)
gs_get_pop_paths(gs) 
```
It now only contains the root node. We can add our first `rectangleGate`:
```{r add-rectGate}
rg <- rectangleGate("FSC-H"=c(200,400), "SSC-H"=c(250, 400), filterId="rectangle")
nodeID <- gs_pop_add(gs, rg)
nodeID
gs_get_pop_paths(gs)  
```
Note that the gate is added to the root node by default if the parent is not specified.
Then we add a `quadGate` to the new population generated by the `rectangleGate` which is named after the `filterId` of the gate because the name was not specified when the `add` method was called.
```{r add-quadGate}
qg <- quadGate("FL1-H"= 0.2, "FL2-H"= 0.4)
nodeIDs <- gs_pop_add(gs,qg,parent="rectangle")
nodeIDs 
gs_get_pop_paths(gs)
```
Here `quadGate` produces four population nodes/populations named after the dimensions of the gate if names are not specified.

A Boolean gate can also be defined and added to GatingSet:
```{r add-boolGate}
bg <- booleanFilter(`CD15 FITC-CD45 PE+|CD15 FITC+CD45 PE-`)
bg
nodeID2 <- gs_pop_add(gs,bg,parent="rectangle")
nodeID2
gs_get_pop_paths(gs)
```
The gating hierarchy is plotted by:
```{r plot-gh}
plot(gs, bool=TRUE)
```
Note that Boolean gate is skipped by default and thus needs to be enabled explictily.

Now all the gates are added to the gating tree but the actual data is not gated yet.
This is done by calling the `recompute` method explictily:
```{r recompute}
recompute(gs)
```
After gating is finished, gating results can be visualized by the `autoplot` method:
```{r autoplot-rect}
autoplot(gs,"rectangle") #plot one Gate
```
Multiple gates can be plotted on the same panel:
```{r autoplot-multiple}
autoplot(gs, gs_pop_get_children(gs[[1]], "rectangle")[1:4])
```
We may also want to plot all the gates without specifying the gate index:
```{r autoplot-gh-bool}
autoplot(gs[[1]])
```

If we want to remove one node, simply:
```{r rm}
Rm('rectangle', gs)
gs_get_pop_paths(gs)
```
As we see, removing one node causes all its descendants to be removed as well.

### Archive and Clone

Oftentimes, we need to save a `GatingSet` including the gated flow data, gates, and populations to disk and reload it later on. This can be done by:
```{r archive,eval=FALSE}
tmp <- tempdir()
save_gs(gs,path = file.path(tmp,"my_gs"))
gs <- load_gs(file.path(tmp,"my_gs"))
``` 
We also provide the `clone` method to make a full copy of an existing `GatingSet`:
```{r clone,eval=FALSE}
gs_gs_cloned <- clone(gs)
```

Note that the `GatingSet` is a purely reference class with an external pointer that points to the internal 'C' data structure. So make sure to use these methods in order to save or make a copy of an existing `GatingSet` object. The regular R assignment (<-) or `save` routine doesn't work as expected for `GatingSet` objects.

## The cytoframe and cytoset classes

The `GatingSet` class no longer uses `flowFrame` and `flowSet` objects for containing the underlying flow data, but rather now uses the analogous `cytoframe` and `cytoset` classes. `cytoframe` and `cytoset` are essentially reference classes with pointers to internal 'C' data structures and thus enable `GatingSet` operations to be performed more efficiently.

While working with `GatingSet` objects will often entail working with `cytoframe` and `cytoset` objects implicitly, it is also possible to directly work with objects of both of these classes.

### Reading a `cytoframe`

Instead of `read.FCS()`, `cytoframe` objects can be created from FCS files with the `load_cytoframe_from_fcs()` method. The optional `num_threads` argument allows for parallelization of the read operation.
```{r load_cf}
files <- list.files(path, "Cyto", full.names = TRUE)
cf <- load_cytoframe_from_fcs(files[1], num_threads = 4)
cf
```

Instead of using `read.FCSheader()` to obtain only the header of the file, just use the `text.only` argument to `load_cytoframe_from_fcs()`.
```{r load_cf_header}
cfh <- load_cytoframe_from_fcs(files[1], text.only = TRUE)
cfh
```

### `cytoframe` Accessors 

The accessor methods function the same as they would for a `flowFrame`
```{r dim_cf}
dim(cf)
```
```{r colnames_cf}
colnames(cf)
```
```{r exprs_cf}
head(exprs(cf))
```
```{r spill_cf}
spillover(cf)
```
```{r keys_cf}
head(keyword(cf))
```

### Pass By Reference

As `cytoframe` and `cytoset` are reference classes, copying objects of either class by the assignment operator (`<-`) will simply provide a copy of the external pointer and so changes made to the copy will also affect the original object. 

```{r ref1_cf}
cf1 <- cf # cf is a reference
colnames(cf1)[1]
```
```{r ref2_cf}
colnames(cf1)[1] <- "t"
colnames(cf)[1] # The change affects the original cf object
```

### Views

Extracting a subset of a `cytoframe` is not computationally intensive, as it merely constructs a view of the data of the original `cytoframe`. However, both objects still share the same underlying pointer to all of the data and thus changes to a view will affect the data of the original `cytoframe`.

```{r view_cf}
cf1 <- cf[1:10, 2:3]
dim(cf1)
exprs(cf)[2,3]
exprs(cf1)[2,2] <- 0 # data change affects the orignal cf
exprs(cf)[2,3]
```
To construct a new view of an entire `cytoframe`, use the `[]` method rather than the `<-` operator. This will ensure that a new view is created to the full underlying dataset.
```{r shallow_cf}
cf1 <- cf[]
```

### Deep Copy

It is also possible to perform a deep copy of a `cytoframe` or a view of it, resulting in two objects pointing to distinct C-level representations of the data. This is accomplished with the `realize_view` method.
```{r deep_cf}
cf <- load_cytoframe_from_fcs(files[1], num_threads = 4) # starting fresh
cf1 <- realize_view(cf[1:10, 2:3])
dim(cf1)
exprs(cf)[2,3]
exprs(cf1)[2,2] <- 0 # data change no longer affects the original cf
exprs(cf)[2,3]
exprs(cf1)[2,2] # but does affect the separate data of cf1
```
Similarly, if a deep copy of all of the data is desired (not a subset), simply call `realize_view` on the original `cytoframe`.

### Interconversion between `cytoframe` and `flowFrame`

Conversion of objects between the `cytoframe` and `flowFrame` classes is accomplished with a few coercion methods
```{r coerce_cf}
fr <- cytoframe_to_flowFrame(cf)
class(fr)
cf_back <- flowFrame_to_cytoframe(fr)
class(cf_back)
```
Of course (as a side note), here `flowFrame_to_cytoframe()` had no knowledge of the `cytoframe` origin of `fr`, so `cf_back` points to a new copy of the underlying data.
```{r pnt_cmp}
identical(cf@pointer, cf_back@pointer) # These point to distinct copies of the data
```

### Saving/Loading a `cytoframe` in h5

A couple of methods handle the task of writing or reading a `cytoframe` in the HDF5 format on disk
```{r h5}
tmpfile <- tempfile(fileext = ".h5")
cf_write_h5(cf, tmpfile)
loaded <- load_cytoframe_from_h5(tmpfile)
```

### `cytoset` methods

Most of the above methods for `cytoframe` objects have `cytoset` analogs.

For reading in a `cytoset` from FCS files, use `load_cytoset_from_fcs`
```{r load_cs}
files <- list.files(path, "Cyto", full.names = TRUE)
cs <- load_cytoset_from_fcs(files, num_threads = 4)
cs
```
Once constructed, it can be saved/loaded through more efficient archive format. 
```{r}
tmp <- tempfile()
save_cytoset(cs, tmp)
cs <- load_cytoset(tmp, h5_readonly = FALSE)
```
note that `h5_readonly` is set to `TRUE` by default to protect the data from accidental changes. So it has to be turned off explicitly if your want to modify the loaded `cs` 

The accessor methods function the same as they would for a `flowSet`
```{r colnames_cs}
colnames(cs)
```

Subsetting using `[` will work in a manner similar to that for a `flowSet`, but will result in another `cytoset` that is a view in to the data of the original `cytoset`. The `Subset()` method, when called on a `cytoset`, will also return a `cytoset` that is a view in to the orignal data rather than a deep copy.

```{r subset_cs}
sub_cs <- cs[1]
```

**Important:** Perhaps unexpectedly, extraction using `[[` on a `cytoset` will by default return a `flowFrame` instead of a `cytoframe` and so will represent a copy of the underlying data rather than a view. Thus, altering the result of the extraction **will not** alter the underlying data of the original `cytoset`. This is done for the sake of backwards compatibility with older user scripts.

```{r extract1_cf}
sub_fr <- cs[[1]]
exprs(cs[[1]])[2,2]
exprs(sub_fr)[2,2] <- 0 # This WILL NOT affect the original data
exprs(cs[[1]])[2,2]
```

To circumvent this behavior and instead return a `cytoframe` that represents a view in to the data of the `original` cytoset, you need to use the `returnType` argument.

```{r extract2_cf}
sub_cf <- cs[[1, returnType = "cytoframe"]]
exprs(cs[[1]])[2,2]
exprs(sub_cf)[2,2] <- 0 # This WILL affect the original data
exprs(cs[[1]])[2,2]
```

Alternatively, if it is easier to remember, `get_cytoframe_from_cs` will accomplish the same goal
```{r extract3_cf}
sub_cf <- get_cytoframe_from_cs(cs,1)
```

Finally, the `[]` and `realize_view()` methods work in a similar manner for `cytoset` objects as `cytoframe` objects. `[]` will return a view in to the original data while `realize_view()` will perform a deep copy.

## Troubleshooting and error reporting
If this package is throwing errors when parsing your workspace, contact the package author by emails for post an issue on https://github.com/RGLab/flowWorkspace/issues. If you can send your workspace by email, we can test, debug, and fix the package so that it works for you. Our goal is to provide a tool that works, and that people find useful.