--- title: "The *exomePeak2* user's guide" author: - name: Zhen Wei affiliation: - Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, 215123, China - Institute of Integrative Biology, University of Liverpool, L7 8TX, Liverpool, United Kingdom email: ZhenWei01@xjtlu.edu.cn - name: Jia Meng affiliation: - Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, 215123, China - Institute of Integrative Biology, University of Liverpool, L7 8TX, Liverpool, United Kingdom email: JiaMeng@xjtlu.edu.cn date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true graphics: yes vignette: > %\VignetteIndexEntry{The exomePeak2 user's guide} %\VignetteEngine{knitr::rmarkdown} %\VignettePackage{exomePeak2} %\VignetteEncoding{UTF-8} --- ```{r para, echo = FALSE, results='hide'} knitr::opts_chunk$set(dev="png",fig.show="hold", fig.width=8,fig.height=4.5,fig.align="center", message=FALSE,collapse=TRUE) ``` # Introduction **exomePeak2** provides bias awared quantification and peak detection on Methylated RNA immunoprecipitation sequencing data (**MeRIP-Seq**). *MeRIP-Seq* is a commonly applied sequencing technology to measure the transcriptome-wide location and abundance of RNA modification sites under a given cellular condition. However, the quantification and peak calling in *MeRIP-Seq* are sensitive to PCR amplification bias which is prevalent in next generation sequencing (NGS) techniques. In addition, the **RNA-Seq** based count data exhibits biological variation in small reads count. *exomePeak2* collectively address these challanges by introducing a rich set of robust data science models tailored for MeRIP-Seq. With *exomePeak2*, users can perform peak calling, modification site quantification, and differential analysis with a straightforward one-step function. Alternatively, users could define personalized methods for their own analysis through multi-step functions and diagnostic plots. # Installation To install exomePeak2 from bioconductor, sart R (version >"3.6") and enter: ```{r, eval=FALSE} if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("exomePeak2") ``` For order versions of R, please refer to the appropriate [Bioconductor release](https://www.bioconductor.org/about/release-announcements/}. # Peak Calling For peak calling of *MeRIP-Seq* experiment, exomePeak2 demands the reads alignment results in **BAM** files. Users can specify the biological replicates of the IP and input samples by a character vector of the corresponding **BAM** directories at the arguments `bam_ip` and `bam_input` separately. In the following example, the transcript annotation is provided using GFF files. Transcript annotation can also be provided by the `TxDb` object. exomePeak2 will automatically download the TxDb if the `genome` argument is filled with the corresponding UCSC genome name. The genome sequence is required to conduct GC content bias correction. If the `genome` argument is missing ( `= NULL` ), exomPeak2 will perform peak calling without correcting the GC content bias. ```{r, eval = TRUE} library(exomePeak2) set.seed(1) GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2") f1 = system.file("extdata", "IP1.bam", package="exomePeak2") f2 = system.file("extdata", "IP2.bam", package="exomePeak2") f3 = system.file("extdata", "IP3.bam", package="exomePeak2") f4 = system.file("extdata", "IP4.bam", package="exomePeak2") IP_BAM = c(f1,f2,f3,f4) f1 = system.file("extdata", "Input1.bam", package="exomePeak2") f2 = system.file("extdata", "Input2.bam", package="exomePeak2") f3 = system.file("extdata", "Input3.bam", package="exomePeak2") INPUT_BAM = c(f1,f2,f3) exomePeak2(bam_ip = IP_BAM, bam_input = INPUT_BAM, gff_dir = GENE_ANNO_GTF, genome = "hg19", paired_end = FALSE) ``` exomePeak2 will export the modification peaks in formats of **BED** file and **CSV** table, the data will be saved automatically under a folder named by `exomePeak2_output`. The modification peak statistics are derived from the ${\beta}_{i,1}$ terms in the following linear regression design. $$log2(Q_{i,j}) = {\beta}_{i,0} + {\beta}_{i,1}I(\rho(j)=IP) + t_{i,j}$$ $Q_{i,j}$ is the expected value of reads abundence of modification $i$ under sample $j$. ${\beta}_{i,0}$ is the intercept coefficient, ${\beta}_{i,1}$ is the coefficient for IP/input log2 fold change, $I(\rho(j)=IP)$ is the regression covariate that is the indicator variable for the sample $j$ being IP sample. $t_{i,j}$ is the regression offset that account for the sequencing depth variation and the GC content biases. Under the default settings, the linear models fitted are the regularized **GLM (Generalized Linear Model)** of NB developed by **DESeq2**. If one of the IP and input group has no biological replicates, Poisson GLMs will be fitted to the modification peaks. Explaination over the columns of the exported table: - ***chr***: the chromosomal name of the peak. - ***chromStart***: the start of the peak on the chromosome. - ***chromEnd***: the end of the peak on the chromosome. - ***name***: the unique ID of the modification peak. - ***score***: the -log2 p value of the peak. - ***strand***: the strand of the peak on genome. - ***thickStart***: the start position of the peak. - ***thickEnd***: the end position of the peak. - ***itemRgb***: the column for the RGB encoded color in BED file visualization. - ***blockCount***: the block (exon) number within the peak. - ***blockSizes***: the widths of blocks. - ***blockStarts***: the start positions of blocks. - ***geneID***: the gene ID of the peak. - ***ReadsCount.input***: the reads count of the input sample. - ***ReadsCount.IP***: the reads count of the IP sample. - ***log2FoldChange***: the estimates of IP over input log2 fold enrichment (coefficient estimates of ${\beta}_{i,1}$). - ***pvalue***: the Wald test p value on the modification coefficient. - ***padj***: the adjusted Wald test p value using BH approach. # Differential Modification Analysis The code below could conduct differential modification analysis (Comparison of Two Conditions) on exon regions defined by the transcript annotation. In differential modification mode, exomePeak2 will first perform Peak calling on exon regions using both the control and treated samples. Then, it will conduct the differential modification analysis on peaks reported from peak calling using an interactive GLM. ```{r, eval = TRUE} f1 = system.file("extdata", "treated_IP1.bam", package="exomePeak2") TREATED_IP_BAM = c(f1) f1 = system.file("extdata", "treated_Input1.bam", package="exomePeak2") TREATED_INPUT_BAM = c(f1) exomePeak2(bam_ip = IP_BAM, bam_input = INPUT_BAM, bam_treated_input = TREATED_INPUT_BAM, bam_treated_ip = TREATED_IP_BAM, gff_dir = GENE_ANNO_GTF, genome = "hg19", paired_end = FALSE) ``` In differential modification mode, exomePeak2 will export the differential modification peaks in formats of **BED** file and **CSV** table, the data will also be saved automatically under a folder named by `exomePeak2_output`. The peak statistics in differential modification setting are derived from the interactive coefficient ${\beta}_{i,3}$ in the following regression design of the **NB GLM**: $$log2(Q_{i,j}) = {\beta}_{i,0} + {\beta}_{i,1}I(\rho(j)=IP) + {\beta}_{i,2}I(\rho(j)=Treatment) + {\beta}_{i,3}I(\rho(j)=IP\&Treatment) + t_{i,j}$$ Explaination for the additional table columns: - ***ModLog2FC_control***: the modification log2 fold enrichment in the control condition. - ***ModLog2FC_treated***: the modification log2 fold enrichment in the treatment condition. - ***DiffModLog2FC***: the log2 Fold Change estimates of differential modification (coefficient estimates of ${\beta}_{i,3}$). - ***pvalue***: the Wald test p value on the differential modification coefficient. - ***padj***: the adjusted Wald test p value using BH approach. # Quantification and Statistical Analysis with Single Based Modification Annotation exomePeak2 supports the modification quantification and differential modification analysis on single based modification annotation. The modification sites with single based resolution can provide a more accurate mapping of modification locations compared with the peaks called directly from the MeRIP-seq datasets. Some of the datasets in epitranscriptomics have a single based resolution, e.x. Data generated by the *m6A-CLIP-Seq* or *m6A-miCLIP-Seq* techniques. Reads count on the single based modification sites could provide a more accurate and consistent quantification on *MeRIP-Seq* experiments due to the elimination of the technical variation introduced by the feature lengths. exomePeak2 will automatically initiate the mode of single based modification quantification by providing a sigle based annotation file under the argument `mod_annot`. The single based annotation information should be provided to the exomePeak2 function in the format of a `GRanges` object. ```{r, eval = TRUE} f2 = system.file("extdata", "mod_annot.rds", package="exomePeak2") MOD_ANNO_GRANGE <- readRDS(f2) exomePeak2(bam_ip = IP_BAM, bam_input = INPUT_BAM, gff_dir = GENE_ANNO_GTF, genome = "hg19", paired_end = FALSE, mod_annot = MOD_ANNO_GRANGE) ``` In this mode, exomePeak2 will export the analysis result also in formats of **BED** file and **CSV** table, while each row of the table corresponds to the sites of the annotation `GRanges`. # Peak Calling and Visualization in Multiple Steps The exomePeak2 package can achieve peak calling and peak statistics calulation with multiple functions. **1. Check the bam files of MeRIP-seq data before peak calling.** ```{r, eval = TRUE} MeRIP_Seq_Alignment <- scanMeripBAM( bam_ip = IP_BAM, bam_input = INPUT_BAM, paired_end = FALSE ) ``` For MeRIP-seq experiment with interactive design (contain control and treatment groups), use the following code. ```{r, eval = TRUE} MeRIP_Seq_Alignment <- scanMeripBAM( bam_ip = IP_BAM, bam_input = INPUT_BAM, bam_treated_input = TREATED_INPUT_BAM, bam_treated_ip = TREATED_IP_BAM, paired_end = FALSE ) ``` **2. Conduct peak calling analysis on exons using the provided bam files.** ```{r, eval = TRUE} SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment, gff_dir = GENE_ANNO_GTF, genome = "hg19") ``` Alternatively, use the following code to quantify MeRIP-seq data on single based modification annotation. ```{r, eval = TRUE} SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment, gff_dir = GENE_ANNO_GTF, genome = "hg19", mod_annot = MOD_ANNO_GRANGE) ``` **3. Estimate size factors that are required for GC content bias correction.** ```{r, eval = TRUE} SummarizedExomePeaks <- normalizeGC(SummarizedExomePeaks) ``` **4. Report the statistics of modification peaks using Generalized Linear Model (GLM).** ```{r, eval = FALSE} SummarizedExomePeaks <- glmM(SummarizedExomePeaks) ``` Alternatively, If the treated IP and input bam files are provided, `glmDM` function could be used to conduct differential modification analysis on modification Peaks with interactive GLM. ```{r, eval = TRUE} SummarizedExomePeaks <- glmDM(SummarizedExomePeaks) ``` **5. Generate the scatter plot between GC content and log2 Fold Change (LFC).** ```{r, eval = TRUE, fig.align='center', fig.height = 2.8, fig.width = 5} plotLfcGC(SummarizedExomePeaks) ``` **6. Generate the bar plot for the sequencing depth size factors.** ```{r, eval = TRUE} plotSizeFactors(SummarizedExomePeaks) ``` **7. Export the modification peaks and the peak statistics with user decided format.** ```{r, eval = TRUE} exportResults(SummarizedExomePeaks, format = "BED") ``` # Contact Please contact the maintainer of exomePeak2 if you have encountered any problems: **ZhenWei** : Please visit the github page of exomePeak2: # Session Info ```{r} sessionInfo() ```