1 Rants
2 Olde school / neue school
- 2.1 Continuous integration
- 2.2 Do I / Will I use…
3 Acknowledgements

1 Rants

1.1 Classic, tidy, rich data representations

Simple example: classic vs. tidy

aggregate(hp ~ cyl, mtcars, mean)                     # classic

##   cyl        hp
## 1   4  82.63636
## 2   6 122.28571
## 3   8 209.21429

library(tidyverse)
mtcars %>% group_by(cyl) %>% summarize(hp = mean(hp)) # tidy

## # A tibble: 3 x 2
##     cyl        hp
##   <dbl>     <dbl>
## 1     4  82.63636
## 2     6 122.28571
## 3     8 209.21429

Biological example

library(airway)
data(airway)
airway                                   # rich

## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

hist(log10(rowMeans(assay(airway))))

library(reshape2)
library(tibble)
tbl <- as_tibble(melt(
    assay(airway), c("Gene", "Sample"), value.name="Count"
))
tbl                                      # tidy (incomplete -- no metadata)

## # A tibble: 512,816 x 3
##               Gene     Sample Count
##             <fctr>     <fctr> <int>
##  1 ENSG00000000003 SRR1039508   679
##  2 ENSG00000000005 SRR1039508     0
##  3 ENSG00000000419 SRR1039508   467
##  4 ENSG00000000457 SRR1039508   260
##  5 ENSG00000000460 SRR1039508    60
##  6 ENSG00000000938 SRR1039508     0
##  7 ENSG00000000971 SRR1039508  3251
##  8 ENSG00000001036 SRR1039508  1433
##  9 ENSG00000001084 SRR1039508   519
## 10 ENSG00000001167 SRR1039508   394
## # ... with 512,806 more rows

tbl %>% group_by(Gene) %>% summarize(mean(Count))

## # A tibble: 64,102 x 2
##               Gene `mean(Count)`
##             <fctr>         <dbl>
##  1 ENSG00000000003       741.875
##  2 ENSG00000000005         0.000
##  3 ENSG00000000419       534.875
##  4 ENSG00000000457       242.000
##  5 ENSG00000000460        58.375
##  6 ENSG00000000938         0.375
##  7 ENSG00000000971      6034.750
##  8 ENSG00000001036      1305.000
##  9 ENSG00000001084       615.125
## 10 ENSG00000001167       392.500
## # ... with 64,092 more rows

Context

Classic: sample x feature data.frame().
Tidy: long-form data.frame(); see the tidyverse
Rich: domain-specific, object oriented.

Vocabulary

Simple: extensive
Tidy: restricted ‘endomorphisms’
Rich: extensive, ‘meaningful’

Constraints (e.g., probes & samples)

Tidy: implicit
Simple, Rich: explicit

Flexibility

Simple, tidy: general-purpose
Rich: specialized

Programming contract

Simple, tidy: limited
Rich: strict

Lessons learned / best practices

Considerable value in semantically rich structures
Current implementations trade-off user and developer convenience
Endomorphism, simple vocabulary, consistent paradigm aid use

1.2 Scaling and performance

library(BiocFileCache)
library(TENxGenomics)              # https://github.com/mtmorgan/TENxGenomics
fl <- bfcrpath(BiocFileCache(), "1M_neurons_filtered_gene_bc_matrices_h5.h5")

## Warning: call dbDisconnect() when finished working with a connection

TENxGenomics(fl)                          # big!

## class: TENxGenomics 
## h5path: /home/mtmorgan/.cache/BiocFileCache/38e45fadc64 
## dim(): 27998 x 1306127

m = as.matrix(TENxMatrix(fl)[, 1:1000])   # first 1000 cells
sum(m == 0) / length(m)                   # % zeros

## [1] 0.9276541

sum(rowSums(m == 0) == ncol(m)) / nrow(m) # % genes with no expression

## [1] 0.4264947

hist(log10(1 + colSums(m)))               # reads per cell

Write efficient R code first (10 - 100x speedup, often cleaner and more robust)
Scale to cores, cluster, cloud when needed
- 10x (cores) - 100x (large clusters / clouds) ‘faster’
- 10x - 100x more complicated and fragile!
Large data requires different approach – illusion of in-memory, iteration

1.3 Visualization, interactivity, …

library(ggplot2)
library(plotly)
static <- mtcars %>% mutate(cyl = factor(cyl)) %>%
    ggplot(aes(x=qsec, y=mpg)) +
        geom_point(aes(color=cyl))
ggplotly(static)

plotly::ggplotly() makes interactive visualization more-or-less a no-brainer.
shiny is also ‘fun’ and ‘easy’ interactivity.
- Especially valuable for throw-away applications, e.g., for use with a local collaborator for data exploration.
- More ambitious problems must be built on top of robust, testable, functions and script-based work flows implemented as fully documented and tested R packages.

2 Olde school / neue school

2.1 Continuous integration

Commits trigger a script to build and check your package
Continuous integration on Linux and Windows platforms via (mostly) GitHub webhooks or integrations
- Travis-CI: Linux setup for continuous integration
- AppVeyor: Windows instance for package testing (available for 32 & 64 bit architectures)
- Wercker: Docker-container-based integration option may work with already available Bioconductor Docker images
- Codeship: Supports bitbucket
- R-Hub: Integrated with R, still in development
The first three work with package-level YAML files to setup an integration environment. Cost-effective option for cross-platform testing.

2.2 Do I / Will I use…

devtools create(), load_all(), check()

Yes, until I run into mysterious S4 / namespace problems. These are often devtools limitations.

roxygen2 (e.g., devtools::document())

Yes, but I pay a lot of attention to consolidating documentation on fewer pages – not ‘one function, one man page’ – and end up writing important parts of the docs in Rd-style.

testthat (e.g., devtools::use_testthat(), devtools::test())

Yes, unit tests are essential, and this is a nice package.

Rcpp

No, but I feel like I should…
STL containers and algorithms can help a lot with formulating basic solutions.
Effective Rcpp use (given my current state of R’s C API) requires a deeper understanding of C++.

Pipes and lazy evaluation (e.g., magrittr, %>%)

No, not for robust code development.
Emphasizes ‘endomorphisms’ (function returns same object as first argument), which is probably good.

S3 classes or S4 ‘reference’ classes

Only under limited circumstances.
S3: internal and lightweight, or for interoperability.
S4 ‘reference’ classes – when reference-based semantics are appropriate – almost never with data objects; sometimes it seems like files and connections have reference-based semnatics, but I often try this and ‘get it wrong’…

Other class systems

No, R is weird enough.

git and GitHub

Yes, git’s local repositories and branching are great.
GitHub’s issue tracker and pull requests are useful for relatively small collections of repositories.

Continuous integration

Yes, but I’ll still rely on local R CMD build / R CMD check before pushing commits.

(R / Bioconductor) Docker and AMI containers

No, not for my daily work; it pays to have a fully function native installation.
AMIs are great for courses.

3 Acknowledgements

Research reported in this course was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)

Doing it better

June 19 - 20, 2017

Contents