--- title: "ChIPpeakAnno FAQs" author: "Jianhong Ou, Jun Yu, Lihua Julie Zhu" output: BiocStyle::html_document date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('ChIPpeakAnno')`" vignette: > %\VignetteIndexEntry{ChIPpeakAnno FAQs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, eval = FALSE, warning = FALSE, error = FALSE) ``` ## Peak Annotation **Q**: How to import peaks? **A**: Function `toGRanges` is designed to import peak files. Users can also import the peaks by `rtracklayer::import` method. **Q**: How to properly annotate the peaks from ChIP-seq data? **A**: Multiple choices are provided in function `annotatePeakInBatch` with the _output_ argument. - If you are interested in nearest features around the peaks, set it to "nearestLocation" or "shortestDistance". - If you are interested in peaks binding to the promoter regions, set it to "nearestBiDirectionalPromoters" to report bidirectional promoters if there are promoters in both directions in the given region (defined by _bindingRegion_). Otherwise, it will report the closest promoter in one direction. - If you are interested in peaks binding to the gene body, set it to "overlapping" will output overlapping features with maximum gap specified as maxgap between peak range and feature range. **Q**: How to prepare the _AnnotationData_ for `annotatePeakInBatch`? **A**: To prepare the lastest released annotations of the assembly will help user getting accurate annotations. ChIPpeakAnno can use multiple sources as _AnnotationData_. `toGRanges` can convert `TxDb` and `EnsDb` to _AnnotationData_. `getAnnotation` will retrieve annotation data from bioMart. _AnnotationData_ could also be a user-defined list in `GenomicRanges::GRanges` format, e.g., another list of peaks. **Q**: If we use `annoPeaks` to annotate the start site of features, why the number of annotated peaks sometimes are smaller than we do `findOverlaps` for the peaks and features? **A**: `annoPeaks` will make sure the downstream annotation range for startSite and upstream annotation range for endSite are within the features. ## Find Overlaps of Peaks **Q**: Why is the sum of peak numbers in the Venn-diagram not equal to the sum of the numbers in the original peaks list? **A**: This question is a very typical one for calculating intersection of peaks. As you know that a peak is a range of continuous points instead of a single point. If we consider intersection of set A (1-2, 4-5, 7-9) with set B (2-8), how many peaks should we output as the overlapping set, ie., 1 or 3? It would be 1 if we uses B as the reference set, and 3 if we use A as the reference set. In ChIPpeakAnno (release version), by default, we are using the minimal number for intersect peaks, ie., 1 in this case. **Q**: Why is the peak number in the Venn-digram not equal to the length of peaklist in the output of `findOverlapsOfPeaks`? **A**: There is an argument called connectedPeaks. In the documentation, we described the argument connectedPeaks as _If multiple peaks involved in overlapping in several groups, set it to "merge" will count it as 1, while set it to "min" will count it as the minimal involved peaks in any group of connected/overlapped peaks_. The default is “min”. By default, the program will select the minimal number from each peaklist involved in one merged peak. So the number will be no less than the number in the peaklist. If user set connectedPeaks to merge, the number will be exactly the same as the number in the peaklist. Here is a simple example to understand the difference between "min" and "merge": ```{r} p1 <- GRanges("1", IRanges(c(1, 4, 7), width=2)) p2 <- GRanges("1", IRanges(c(2, 5), width=3)) ol_min <- findOverlapsOfPeaks(p1, p2, connectedPeaks="min") ## the counts will be the minimal peaks involved in that group of ## connected peaks, so you get 2. ol_merge <- findOverlapsOfPeaks(p1, p2, connectedPeaks="merge") ## the counts will be 1 for each group of connected peaks ol_min$venn_cnt ol_merge$venn_cnt ``` **Q**: Is there a way to show the number of peaks in original peak list? **A**: Try to set connectedPeaks="keepAll" in `findOverlapsOfPeaks` and `makeVennDiagram`. **Q**: How to extract the original peak IDs of the overlapping peaks? **A**: In the output of `findOverlapsOfPeaks`, there is a column in the metadata of each element in peaklist, called peakNames, which is a CharacterList. The CharacterList is a list of the contributing peak ids with prefix, eg. peaks1__peakname1, peaksi__peaknamej. Users can access the original peak name by split those characters. Here is the sample code: ```{r} library(ChIPpeakAnno) bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno") gr1 <- toGRanges(bed, format="BED", header=FALSE) gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno") gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3) ol <- findOverlapsOfPeaks(gr1, gr2) peakNames <- ol$peaklist[['gr1///gr2']]$peakNames library(reshape2) peakNames1 <- melt(peakNames, value.name="merged.peak.id") peakNames1 <- cbind(peakNames1[, 1], do.call(rbind, strsplit(as.character(peakNames1[, 3]), "__"))) colnames(peakNames1) <- c("merged.peak.id", "group", "peakName") head(peakNames1) gr1.subset <- gr1[peakNames1[peakNames1[, "group"] %in% "gr1", "peakName"]] gr2.subset <- gr2[peakNames1[peakNames1[, "group"] %in% "gr2", "peakName"]] ``` Here is an alternative way to access the original peak IDs. ```{r} all.peaks <- ol$all.peaks gr1.renamed <- all.peaks$gr1 gr2.renamed <- all.peaks$gr2 peakNames <- melt(ol$peaklist[['gr1///gr2']]$peakNames, value.name="merged.peak.id") gr1.sub <- gr1.renamed[peakNames[grepl("^gr1", peakNames[, 3]), 3]] gr2.sub <- gr2.renamed[peakNames[grepl("^gr2", peakNames[, 3]), 3]] ``` **Q**: How to select the proper number for totalTest in the function `makeVennDiagram`? **A**: When we test the association between two sets of data based on hypergeometric distribution, the number of all potential binding sites is required. The parameter _totalTest_ in the function `makeVennDiagram` indicates how many potential peaks in total will be used in the hypergeometric test. It should be larger than the largest number of peaks in the peak list. The smaller it is set, the more stringent the test is. The time used to calculate p-value does not depend on the value of the _totalTest_. For practical guidance on how to choose _totalTest_, please refer to the [post](https://stat.ethz.ch/pipermail/bioconductor/2010-November/036540.html).