--- title: "An introduction to iscream" author: - name: James Eapen affiliation: - Department of Epigenetics, Van Andel Institute, Grand Rapids, MI output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{An introduction to iscream} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: refs.bib link-citations: yes --- ```{r setup, include=FALSE} options("datatable.print.nrows" = 50) ``` ## Introduction iscream is a BED file querying package that can can retrieve records within regions of interest from multiple [tabixed](https://en.wikipedia.org/wiki/Tabix) BED files. It can make queries like tabix, summarize data within regions and make matrices across input files and regions. All operations can be done in parallel and return R/Bioconductor data structures. iscream was designed to be memory efficient and fast on large datasets. This vignette demonstrates iscream's functions on a small dataset. For more information on performance and considerations on larger real datasets see `vignette("performance")`. ## Installation iscream may be installed from with ```{r install, eval=FALSE} if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("iscream") ``` Although iscream will compile using `Rhtslib`, it will perform best if compiled against [htslib](https://www.htslib.org/download/) which has itself been compiled with `libdeflate`. See `vignette("htslib")` for more information. ### Loading iscream ```{r load_iscream} library(iscream) set_threads(2) ``` On load, iscream will inform the user about the number if threads it is set to use. This is configurable with either `set_threads()` or by setting `option("iscream.threads")`, which can be done in your `.Rprofile` to set it on startup. Once loaded you use the following querying functions. ## `tabix()` tabix is a command-line tool that queries records from compressed BED files. As input it receives the path to a tabixed BED file and one or more genomic regions of interest. It queries and prints the records from the tabixed BED file that fall within the regions of interest. This output may be redirected to a file and read into R. iscream's `tabix()` function works in a similar way but supports multiple files and multiple input formats all in one function. It receives a vector of paths to tabixed BED files and regions of interest. The input regions may be a vector of `"chr:start-end"` strings, a data frame with `"chr"`, `"start"`, and `"end"` columns or a `GRanges` object. It queries the BED files and returns the records as a parsed `r CRANpkg("data.table")`. To get a `GRanges` object instead, use `tabix_gr`. Using iscream eliminates context switching between the shell and R for tabix queries since the queries are made in R and the result is an R data structure. `r Biocpkg("Rsamtools")` is another R package that can query data from BED files, but it only supports one file at a time, only accepts `GRanges` as the input regions and does not parse the tab-delimited strings. See `vignette("tabix")` for a comparison of iscream and Rsamtools. ### Setup Using `list.files()`, you can get the paths to all compressed bed files in a directory. These files contain small regions from chromosome 1 and Y from the snmC-seq2 methylation data [@luo2018a]. ```{r load_data} data_dir <- system.file("extdata", package = "iscream") (bedfiles <- list.files( data_dir, pattern = "cell[1-5].bed.gz$", full.names = TRUE)) ``` For demonstration, I'm using two regions. ```{r make_regions} regions <- c("chr1:184577-680065", "chrY:56877780-56882524") ``` **Note**: in this dataset the chromosome identifier is in a "chr[number]" format. However, in some datasets, the chromosome ID is just a number or a letter like '1', or 'Y'. To see the chromosome ID format across your files you can use `query_chroms()` to get all unique chromosomes across all input files: ```{r query_chroms} query_chroms(bedfiles) ``` ### Making queries `tabix()` one file: ```{r tabix_one} tabix(bedfiles[1], regions) ``` With multiple files, the output contains a column for the file name. ```{r tabix_all} tabix(bedfiles, regions) ``` ### Using `r Biocpkg("GenomicRanges")` Both `tabix()` and `tabix_gr()` accept `GRanges` objects as input regions but `tabix_gr()` returns `GRanges` objects instead of data frames. `tabix_gr` will also preserve any input `GRanges` metadata. ```{r tabix_gr, message=FALSE} if (!require("GenomicRanges", quietly = TRUE)) { stop("The 'GenomicRanges' package must be installed for this functionality") } gr <- GRanges(regions) tabix_gr(bedfiles, gr) ``` ### Setting column names You can set the result data frame's column names or the `GRanges` `mcols` with `col.names`: ```{r tabix_colnames} tabix_gr(bedfiles, regions, col.names = c("beta", "coverage")) ``` ### Rsamtools-style output `tabix_raw()` will return an unparsed named list like `r Biocpkg("Rsamtools")`, but with multi-file support: ```{r tabix_raw} tabix_raw(bedfiles, regions) ``` ### WGBS BED files Since iscream was originally developed to read Whole Genome Bisulfite Sequencing data, it has the `aligner` argument for automatically setting column names for the BISCUIT [@zhou2024], Bismark [@krueger2011], and BSBolt [@farrell2021]. ```{r tabix_wgbs} tabix_gr(bedfiles, regions, aligner = "biscuit") ``` ## `summarize_regions()` Using the same BED files and regions, iscream can also summarize the data within regions. This requires providing the index of the data columns you want summarized and the column names for the output. All summarizing functions are run on each input column by default - see `?summarize_regions` for supported functions. ```{r summary} summarize_regions( bedfiles, regions, columns = c(4, 5), col_names = c("beta", "coverage") ) ``` The `feature` column here contains the genomic region coordinates, but can be set to something more informational if you have names for the regions. You can also select the functions applied with `fun`: ```{r summary_named} names(regions) <- c("R1", "R2") summarize_regions( bedfiles, regions, fun = c("mean", "sum"), columns = 5, col_names = "coverage" ) ``` ### WGBS BED files For WGBS data specifically, you can use `summarize_meth_regions()`, which takes the `aligner` argument to correctly parse the columns. ```{r summarize_meth_regions} summarize_meth_regions( bedfiles, regions, aligner = "biscuit", fun = c("mean", "sum") ) ``` ## `make_mat()` `make_mat()` makes a matrix where every row is a locus found in at least one input file and the columns are the input files. It returns a named list of the matrix, coordinates, and file names you can use for analysis or other data structures. `make_mat_se()` returns a `RangedSummarizedExperiment`, and `make_mat_gr()` returns a `GRanges` object. ```{r make_mat, message=FALSE} if (!require("SummarizedExperiment", quietly = TRUE)) { stop("The 'SummarizedExperiment' package must be installed for this functionality") } (mat <- make_mat_se(bedfiles, regions, column = 4, mat_name = "beta")) head(assay(mat), 10) ``` If you have sparse data, you can save memory with `sparse = TRUE`, but only for `make_mat()` and `make_mat_se()`. ```{r sparse} mat <- make_mat(bedfiles, regions, column = 4, mat_name = "beta", sparse = TRUE) head(mat$beta, 10) ``` ### WGBS BED files For a `r Biocpkg("BSseq")` compatible object, use `make_meth_mat` which returns a list of BSseq inputs. To make the `BSseq` object run ```{r, message=FALSE} if (require("bsseq", quietly = TRUE)) { meth_mat <- make_mat_bsseq(bedfiles, regions) do.call(BSseq, meth_mat) } ``` ## Further reading While this vignette used a toy dataset, `vignette("performance")` has tips on getting the best performance from iscream using a real and larger dataset with 100 files. `vignette("TSS")` is an example of iscream's application to examine methylation around transcription start site on another real dataset. For more information on the supported Bioconductor data structures and conversions see `vignette("data_structures")`. ## Session info ```{r si} sessionInfo() ``` ## References