Brings SummarizedExperiment to the tidyverse!

website: stemangiola.github.io/tidySummarizedExperiment/

Please also have a look at

tidySCE for tidy manipulation of Seurat objects
tidyseurat for tidy manipulation of Seurat objects
tidybulk for tidy high-level data analysis and manipulation
nanny for tidy high-level data analysis and manipulation
tidygate for adding custom gate information to your tibble
tidyHeatmap for heatmaps produced with tidy principles

Introduction

tidySummarizedExperiment provides a bridge between Bioconductor SummarizedExperiment [@morgan2020summarized] and the tidyverse [@wickham2019welcome]. It creates an invisible layer that enables viewing the Bioconductor SummarizedExperiment object as a tidyverse tibble, and provides SummarizedExperiment-compatible dplyr, tidyr, ggplot and plotly functions. This allows users to get the best of both Bioconductor and tidyverse worlds.

Functions/utilities available

SummarizedExperiment-compatible Functions	Description
`all`	After all `tidySummarizedExperiment` is a SummarizedExperiment object, just better

tidyverse Packages	Description
`dplyr`	Almost all `dplyr` APIs like for any tibble
`tidyr`	Almost all `tidyr` APIs like for any tibble
`ggplot2`	`ggplot` like for any tibble
`plotly`	`plot_ly` like for any tibble

Utilities	Description
`tidy`	Add `tidySummarizedExperiment` invisible layer over a SummarizedExperiment object
`as_tibble`	Convert cell-wise information to a `tbl_df`

Installation

if (!requireNamespace("BiocManager", quietly=TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("tidySummarizedExperiment")

From Github (development)

devtools::install_github("stemangiola/tidySummarizedExperiment")

Load libraries used in the examples.

library(ggplot2)
library(tidySummarizedExperiment)

Create `tidySummarizedExperiment`, the best of both worlds!

This is a SummarizedExperiment object but it is evaluated as a tibble. So it is fully compatible both with SummarizedExperiment and tidyverse APIs.

pasilla_tidy <- tidySummarizedExperiment::pasilla

It looks like a tibble

pasilla_tidy

## # A SummarizedExperiment-tibble abstraction: 102,193 x 5
## [90m# Transcripts=14599 | Samples=7 | Assays=counts[39m
##    feature     sample counts condition type      
##    <chr>       <chr>   <int> <chr>     <chr>     
##  1 FBgn0000003 untrt1      0 untreated single_end
##  2 FBgn0000008 untrt1     92 untreated single_end
##  3 FBgn0000014 untrt1      5 untreated single_end
##  4 FBgn0000015 untrt1      0 untreated single_end
##  5 FBgn0000017 untrt1   4664 untreated single_end
##  6 FBgn0000018 untrt1    583 untreated single_end
##  7 FBgn0000022 untrt1      0 untreated single_end
##  8 FBgn0000024 untrt1     10 untreated single_end
##  9 FBgn0000028 untrt1      0 untreated single_end
## 10 FBgn0000032 untrt1   1446 untreated single_end
## # … with 40 more rows

But it is a SummarizedExperiment object after all

Assays(pasilla_tidy)

## An object of class "SimpleAssays"
## Slot "data":
## List of length 1

Tidyverse commands

We can use tidyverse commands to explore the tidy SummarizedExperiment object.

We can use slice to choose rows by position, for example to choose the first row.

pasilla_tidy %>%
    slice(1)

## # A SummarizedExperiment-tibble abstraction: 1 x 5
## [90m# Transcripts=1 | Samples=1 | Assays=counts[39m
##   feature     sample counts condition type      
##   <chr>       <chr>   <int> <chr>     <chr>     
## 1 FBgn0000003 untrt1      0 untreated single_end

We can use filter to choose rows by criteria.

pasilla_tidy %>%
    filter(condition == "untreated")

## # A SummarizedExperiment-tibble abstraction: 58,396 x 5
## [90m# Transcripts=14599 | Samples=4 | Assays=counts[39m
##    feature     sample counts condition type      
##    <chr>       <chr>   <int> <chr>     <chr>     
##  1 FBgn0000003 untrt1      0 untreated single_end
##  2 FBgn0000008 untrt1     92 untreated single_end
##  3 FBgn0000014 untrt1      5 untreated single_end
##  4 FBgn0000015 untrt1      0 untreated single_end
##  5 FBgn0000017 untrt1   4664 untreated single_end
##  6 FBgn0000018 untrt1    583 untreated single_end
##  7 FBgn0000022 untrt1      0 untreated single_end
##  8 FBgn0000024 untrt1     10 untreated single_end
##  9 FBgn0000028 untrt1      0 untreated single_end
## 10 FBgn0000032 untrt1   1446 untreated single_end
## # … with 40 more rows

We can use select to choose columns.

pasilla_tidy %>%
    select(sample)

## # A tibble: 102,193 x 1
##    sample
##    <chr> 
##  1 untrt1
##  2 untrt1
##  3 untrt1
##  4 untrt1
##  5 untrt1
##  6 untrt1
##  7 untrt1
##  8 untrt1
##  9 untrt1
## 10 untrt1
## # … with 102,183 more rows

We can use count to count how many rows we have for each sample.

pasilla_tidy %>%
    count(sample)

## # A tibble: 7 x 2
##   sample     n
##   <chr>  <int>
## 1 trt1   14599
## 2 trt2   14599
## 3 trt3   14599
## 4 untrt1 14599
## 5 untrt2 14599
## 6 untrt3 14599
## 7 untrt4 14599

We can use distinct to see what distinct sample information we have.

pasilla_tidy %>%
    distinct(sample, condition, type)

## # A tibble: 7 x 3
##   sample condition type      
##   <chr>  <chr>     <chr>     
## 1 untrt1 untreated single_end
## 2 untrt2 untreated single_end
## 3 untrt3 untreated paired_end
## 4 untrt4 untreated paired_end
## 5 trt1   treated   single_end
## 6 trt2   treated   paired_end
## 7 trt3   treated   paired_end

We could use rename to rename a column. For example, to modify the type column name.

pasilla_tidy %>%
    rename(sequencing=type)

## # A SummarizedExperiment-tibble abstraction: 102,193 x 5
## [90m# Transcripts=14599 | Samples=7 | Assays=counts[39m
##    feature     sample counts condition sequencing
##    <chr>       <chr>   <int> <chr>     <chr>     
##  1 FBgn0000003 untrt1      0 untreated single_end
##  2 FBgn0000008 untrt1     92 untreated single_end
##  3 FBgn0000014 untrt1      5 untreated single_end
##  4 FBgn0000015 untrt1      0 untreated single_end
##  5 FBgn0000017 untrt1   4664 untreated single_end
##  6 FBgn0000018 untrt1    583 untreated single_end
##  7 FBgn0000022 untrt1      0 untreated single_end
##  8 FBgn0000024 untrt1     10 untreated single_end
##  9 FBgn0000028 untrt1      0 untreated single_end
## 10 FBgn0000032 untrt1   1446 untreated single_end
## # … with 40 more rows

We could use mutate to create a column. For example, we could create a new type column that contains single and paired instead of single_end and paired_end.

pasilla_tidy %>%
    mutate(type=gsub("_end", "", type))

## # A SummarizedExperiment-tibble abstraction: 102,193 x 5
## [90m# Transcripts=14599 | Samples=7 | Assays=counts[39m
##    feature     sample counts condition type  
##    <chr>       <chr>   <int> <chr>     <chr> 
##  1 FBgn0000003 untrt1      0 untreated single
##  2 FBgn0000008 untrt1     92 untreated single
##  3 FBgn0000014 untrt1      5 untreated single
##  4 FBgn0000015 untrt1      0 untreated single
##  5 FBgn0000017 untrt1   4664 untreated single
##  6 FBgn0000018 untrt1    583 untreated single
##  7 FBgn0000022 untrt1      0 untreated single
##  8 FBgn0000024 untrt1     10 untreated single
##  9 FBgn0000028 untrt1      0 untreated single
## 10 FBgn0000032 untrt1   1446 untreated single
## # … with 40 more rows

We could use unite to combine multiple columns into a single column.

pasilla_tidy %>%
    unite("group", c(condition, type))

## # A SummarizedExperiment-tibble abstraction: 102,193 x 4
## [90m# Transcripts=14599 | Samples=7 | Assays=counts[39m
##    feature     sample counts group               
##    <chr>       <chr>   <int> <chr>               
##  1 FBgn0000003 untrt1      0 untreated_single_end
##  2 FBgn0000008 untrt1     92 untreated_single_end
##  3 FBgn0000014 untrt1      5 untreated_single_end
##  4 FBgn0000015 untrt1      0 untreated_single_end
##  5 FBgn0000017 untrt1   4664 untreated_single_end
##  6 FBgn0000018 untrt1    583 untreated_single_end
##  7 FBgn0000022 untrt1      0 untreated_single_end
##  8 FBgn0000024 untrt1     10 untreated_single_end
##  9 FBgn0000028 untrt1      0 untreated_single_end
## 10 FBgn0000032 untrt1   1446 untreated_single_end
## # … with 40 more rows

We can also combine commands with the tidyverse pipe %>%.

For example, we could combine group_by and summarise to get the total counts for each sample.

pasilla_tidy %>%
    group_by(sample) %>%
    summarise(total_counts=sum(counts))

## # A tibble: 7 x 2
##   sample total_counts
##   <chr>         <int>
## 1 trt1       18670279
## 2 trt2        9571826
## 3 trt3       10343856
## 4 untrt1     13972512
## 5 untrt2     21911438
## 6 untrt3      8358426
## 7 untrt4      9841335

We could combine group_by, mutate and filter to get the transcripts with mean count > 0.

pasilla_tidy %>%
    group_by(feature) %>%
    mutate(mean_count=mean(counts)) %>%
    filter(mean_count > 0)

## # A tibble: 86,513 x 6
## # Groups:   feature [12,359]
##    feature     sample counts condition type       mean_count
##    <chr>       <chr>   <int> <chr>     <chr>           <dbl>
##  1 FBgn0000003 untrt1      0 untreated single_end      0.143
##  2 FBgn0000008 untrt1     92 untreated single_end     99.6  
##  3 FBgn0000014 untrt1      5 untreated single_end      1.43 
##  4 FBgn0000015 untrt1      0 untreated single_end      0.857
##  5 FBgn0000017 untrt1   4664 untreated single_end   4672.   
##  6 FBgn0000018 untrt1    583 untreated single_end    461.   
##  7 FBgn0000022 untrt1      0 untreated single_end      0.143
##  8 FBgn0000024 untrt1     10 untreated single_end      7    
##  9 FBgn0000028 untrt1      0 untreated single_end      0.429
## 10 FBgn0000032 untrt1   1446 untreated single_end   1085.   
## # … with 86,503 more rows

Plotting

my_theme <-
    list(
        scale_fill_brewer(palette="Set1"),
        scale_color_brewer(palette="Set1"),
        theme_bw() +
            theme(
                panel.border=element_blank(),
                axis.line=element_line(),
                panel.grid.major=element_line(size=0.2),
                panel.grid.minor=element_line(size=0.1),
                text=element_text(size=12),
                legend.position="bottom",
                aspect.ratio=1,
                strip.background=element_blank(),
                axis.title.x=element_text(margin=margin(t=10, r=10, b=10, l=10)),
                axis.title.y=element_text(margin=margin(t=10, r=10, b=10, l=10))
            )
    )

We can treat pasilla_tidy as a normal tibble for plotting.

Here we plot the distribution of counts per sample.

pasilla_tidy %>%
    tidySummarizedExperiment::ggplot(aes(counts + 1, group=sample, color=`type`)) +
    geom_density() +
    scale_x_log10() +
    my_theme

plot of chunk plot1

Session Info

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] tidySummarizedExperiment_1.2.0 SummarizedExperiment_1.22.0   
##  [3] Biobase_2.52.0                 GenomicRanges_1.44.0          
##  [5] GenomeInfoDb_1.28.0            IRanges_2.26.0                
##  [7] S4Vectors_0.30.0               BiocGenerics_0.38.0           
##  [9] MatrixGenerics_1.4.0           matrixStats_0.58.0            
## [11] ggplot2_3.3.3                  knitr_1.33                    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1       xfun_0.23              purrr_0.3.4           
##  [4] lattice_0.20-44        colorspace_2.0-1       vctrs_0.3.8           
##  [7] generics_0.1.0         viridisLite_0.4.0      htmltools_0.5.1.1     
## [10] utf8_1.2.1             plotly_4.9.3           rlang_0.4.11          
## [13] pillar_1.6.1           glue_1.4.2             withr_2.4.2           
## [16] DBI_1.1.1              RColorBrewer_1.1-2     GenomeInfoDbData_1.2.6
## [19] lifecycle_1.0.0        stringr_1.4.0          zlibbioc_1.38.0       
## [22] munsell_0.5.0          gtable_0.3.0           htmlwidgets_1.5.3     
## [25] evaluate_0.14          labeling_0.4.2         ps_1.6.0              
## [28] fansi_0.4.2            highr_0.9              scales_1.1.1          
## [31] DelayedArray_0.18.0    jsonlite_1.7.2         XVector_0.32.0        
## [34] farver_2.1.0           digest_0.6.27          stringi_1.6.2         
## [37] dplyr_1.0.6            grid_4.1.0             cli_2.5.0             
## [40] tools_4.1.0            bitops_1.0-7           magrittr_2.0.1        
## [43] lazyeval_0.2.2         RCurl_1.98-1.3         tibble_3.1.2          
## [46] crayon_1.4.1           tidyr_1.1.3            pkgconfig_2.0.3       
## [49] ellipsis_0.3.2         Matrix_1.3-3           data.table_1.14.0     
## [52] rstudioapi_0.13        assertthat_0.2.1       httr_1.4.2            
## [55] R6_2.5.0               compiler_4.1.0