---
title: "An introduction to iscream"
author:
- name: James Eapen
  affiliation:
  - Department of Epigenetics, Van Andel Institute, Grand Rapids, MI
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{An introduction to iscream}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: refs.bib
link-citations: yes
---

```{r setup, include=FALSE}
options("datatable.print.nrows" = 50)
```

## Introduction

iscream is a BED file querying package that can can retrieve records within
regions of interest from multiple [tabixed](https://en.wikipedia.org/wiki/Tabix)
BED files. It can make queries like tabix, summarize data within regions and
make matrices across input files and regions. All operations can be done in
parallel and return R/Bioconductor data structures.

iscream was designed to be memory efficient and fast on large datasets. This
vignette demonstrates iscream's functions on a small dataset. For more
information on performance and considerations on larger real datasets see
`vignette("performance")`.

## Installation

iscream may be installed from <https://bioconductor.org> with

```{r install, eval=FALSE}
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("iscream")
```

Although iscream will compile using `Rhtslib`, it will perform best if compiled
against [htslib](https://www.htslib.org/download/) which has itself been
compiled with `libdeflate`. See `vignette("htslib")` for more information.

### Loading iscream

```{r load_iscream}
library(iscream)
set_threads(2)
```

On load, iscream will inform the user about the number if threads it is set to
use. This is configurable with either `set_threads()` or by setting
`option("iscream.threads")`, which can be done in your `.Rprofile` to set it on
startup. Once loaded you use the following querying functions.

## `tabix()`

tabix is a command-line tool that queries records from compressed BED files. As
input it receives the path to a tabixed BED file and one or more genomic regions
of interest. It queries and prints the records from the tabixed BED file that
fall within the regions of interest. This output may be redirected to a file and
read into R.

iscream's `tabix()` function works in a similar way but supports multiple files
and multiple input formats all in one function. It receives a vector of paths to
tabixed BED files and regions of interest. The input regions may be a vector of
`"chr:start-end"` strings, a data frame with `"chr"`, `"start"`, and `"end"`
columns or a `GRanges` object. It queries the BED files and returns the records
as a parsed `r CRANpkg("data.table")`. To get a `GRanges` object instead, use
`tabix_gr`.

Using iscream eliminates context switching between the shell and R for tabix
queries since the queries are made in R and the result is an R data structure.
`r Biocpkg("Rsamtools")` is another R package that can query data from BED
files, but it only supports one file at a time, only accepts `GRanges` as the
input regions and does not parse the tab-delimited strings. See
`vignette("tabix")` for a comparison of iscream and Rsamtools.

### Setup

Using `list.files()`, you can get the paths to all compressed bed files in a
directory. These files contain small regions from chromosome 1 and Y from the
snmC-seq2 methylation data [@luo2018a].

```{r load_data}
data_dir <- system.file("extdata", package = "iscream")
(bedfiles <- list.files(
  data_dir,
  pattern = "cell[1-5].bed.gz$",
  full.names = TRUE))
```

For demonstration, I'm using two regions.

```{r make_regions}
regions <- c("chr1:184577-680065", "chrY:56877780-56882524")
```

**Note**: in this dataset the chromosome identifier is in a "chr[number]"
format. However, in some datasets, the chromosome ID is just a number or a
letter like '1', or 'Y'. To see the chromosome ID format across your files you
can use `query_chroms()` to get all unique chromosomes across all input files:

```{r query_chroms}
query_chroms(bedfiles)
```

### Making queries

`tabix()` one file:

```{r tabix_one}
tabix(bedfiles[1], regions)
```

With multiple files, the output contains a column for the file name.

```{r tabix_all}
tabix(bedfiles, regions)
```

### Using `r Biocpkg("GenomicRanges")`

Both `tabix()` and `tabix_gr()` accept `GRanges` objects as input regions but
`tabix_gr()` returns `GRanges` objects instead of data frames. `tabix_gr` will
also preserve any input `GRanges` metadata.

```{r tabix_gr, message=FALSE}
if (!require("GenomicRanges", quietly = TRUE)) {
  stop("The 'GenomicRanges' package must be installed for this functionality")
}
gr <- GRanges(regions)
tabix_gr(bedfiles, gr)
```

### Setting column names

You can set the result data frame's column names or the `GRanges` `mcols` with
`col.names`:

```{r tabix_colnames}
tabix_gr(bedfiles, regions, col.names = c("beta", "coverage"))
```

### Rsamtools-style output

`tabix_raw()` will return an unparsed named list like `r Biocpkg("Rsamtools")`,
but with multi-file support:

```{r tabix_raw}
tabix_raw(bedfiles, regions)
```

### WGBS BED files

Since iscream was originally developed to read Whole Genome Bisulfite Sequencing
data, it has the `aligner` argument for automatically setting column names for
the BISCUIT [@zhou2024], Bismark [@krueger2011], and BSBolt [@farrell2021].

```{r tabix_wgbs}
tabix_gr(bedfiles, regions, aligner = "biscuit")
```

## `summarize_regions()`

Using the same BED files and regions, iscream can also summarize the data within
regions. This requires providing the index of the data columns you want
summarized and the column names for the output. All summarizing functions are
run on each input column by default - see `?summarize_regions` for supported
functions.

```{r summary}
summarize_regions(
  bedfiles,
  regions,
  columns = c(4, 5),
  col_names = c("beta", "coverage")
)
```

The `feature` column here contains the genomic region coordinates, but can be
set to something more informational if you have names for the regions. You can
also select the functions applied with `fun`:


```{r summary_named}
names(regions) <- c("R1", "R2")
summarize_regions(
  bedfiles,
  regions,
  fun = c("mean", "sum"),
  columns = 5,
  col_names = "coverage"
)
```

### WGBS BED files

For WGBS data specifically, you can use `summarize_meth_regions()`, which takes
the `aligner` argument to correctly parse the columns.

```{r summarize_meth_regions}
summarize_meth_regions(
  bedfiles,
  regions,
  aligner = "biscuit",
  fun = c("mean", "sum")
)
```


## `make_mat()`

`make_mat()` makes a matrix where every row is a locus found in at least one input
file and the columns are the input files. It returns a named list of the matrix,
coordinates, and file names you can use for analysis or other data structures.
`make_mat_se()` returns a `RangedSummarizedExperiment`, and `make_mat_gr()`
returns a `GRanges` object.

```{r make_mat, message=FALSE}
if (!require("SummarizedExperiment", quietly = TRUE)) {
  stop("The 'SummarizedExperiment' package must be installed for this functionality")
}
(mat <- make_mat_se(bedfiles, regions, column = 4, mat_name = "beta"))
head(assay(mat), 10)
```

If you have sparse data, you can save memory with `sparse = TRUE`, but only for
`make_mat()` and `make_mat_se()`.

```{r sparse}
mat <- make_mat(bedfiles, regions, column = 4, mat_name = "beta", sparse = TRUE)
head(mat$beta, 10)
```

### WGBS BED files

For a `r Biocpkg("BSseq")` compatible object, use `make_meth_mat` which returns
a list of BSseq inputs. To make the `BSseq` object run

```{r, message=FALSE}
if (require("bsseq", quietly = TRUE)) {
  meth_mat <- make_mat_bsseq(bedfiles, regions)
  do.call(BSseq, meth_mat)
}
```

## Further reading

While this vignette used a toy dataset, `vignette("performance")` has tips on
getting the best performance from iscream using a real and larger dataset with
100 files. `vignette("TSS")` is an example of iscream's application to examine
methylation around transcription start site on another real dataset. For more
information on the supported Bioconductor data structures and conversions see
`vignette("data_structures")`.

## Session info

```{r si}
sessionInfo()
```

## References