---
title: "ChIPpeakAnno FAQs"
author: "Jianhong Ou, Jun Yu, Lihua Julie Zhu"
output: BiocStyle::html_document
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('ChIPpeakAnno')`"
vignette: >
  %\VignetteIndexEntry{ChIPpeakAnno FAQs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = FALSE, 
                      warning = FALSE, error = FALSE)
```

## Peak Annotation

**Q**: How to import peaks?

**A**: Function `toGRanges` is designed to import peak files. Users can also
import the peaks by `rtracklayer::import` method.

**Q**: How to properly annotate the peaks from ChIP-seq data?

**A**: Multiple choices are provided in function
`annotatePeakInBatch` with the _output_ argument.

- If you are interested in nearest features around the peaks, set it to 
"nearestLocation" or "shortestDistance".

- If you are interested in peaks 
binding to the promoter regions, set it to "nearestBiDirectionalPromoters" to report
bidirectional promoters if there are promoters in both directions in the given 
region (defined by _bindingRegion_). Otherwise, it will report the closest 
promoter in one direction. 

- If you are interested in peaks binding to the gene body, set it to "overlapping" will 
output overlapping features with maximum gap specified as maxgap between peak 
range and feature range.

**Q**: How to prepare the _AnnotationData_ for `annotatePeakInBatch`?

**A**: To prepare the lastest released annotations of the assembly will help 
user getting accurate annotations. 
ChIPpeakAnno can use multiple sources as _AnnotationData_. `toGRanges` can 
convert `TxDb` and `EnsDb` to _AnnotationData_. `getAnnotation` will retrieve
annotation data from bioMart. _AnnotationData_ could also be a user-defined
list in `GenomicRanges::GRanges` format, e.g., another list of peaks.


**Q**: If we use `annoPeaks` to annotate the start site of features, why the number of annotated peaks sometimes are smaller than we do `findOverlaps` for the peaks and features?

**A**: `annoPeaks` will make sure the downstream annotation range for startSite and upstream annotation range for endSite are within the features. 

## Find Overlaps of Peaks
**Q**: Why is the sum of peak numbers in the Venn-diagram not equal to the 
sum of the numbers in the original peaks list?

**A**: This question is a very typical one for calculating intersection of peaks. 
As you know that a peak is a range of continuous points instead of a single point. 
If we consider intersection of set A (1-2, 4-5, 7-9) with set B (2-8), 
how many peaks should we output as the overlapping set, ie., 1 or 3? 
It would be 1 if we uses B as the reference set, and 3 if we use A as the reference set.  
In ChIPpeakAnno (release version), by default, we are using the minimal number 
for intersect peaks, ie., 1 in this case. 

**Q**: Why is the peak number in the Venn-digram not equal to the length of 
peaklist in the output of `findOverlapsOfPeaks`?

**A**: There is an argument called connectedPeaks. In the documentation, 
we described the argument connectedPeaks as _If multiple peaks involved 
in overlapping in several groups, set it to "merge" will count it as 1, 
while set it to "min" will count it as the minimal involved peaks in any 
group of connected/overlapped peaks_. 
The default is “min”. By default, the program will select the minimal number 
from each peaklist involved in one merged peak. 
So the number will be no less than the number in the peaklist. 
If user set connectedPeaks to merge, the number will be exactly the same as the 
number in the peaklist. 
Here is a simple example to understand the difference between "min" and 
"merge":

```{r}
p1 <- GRanges("1", IRanges(c(1, 4, 7), width=2))
p2 <- GRanges("1", IRanges(c(2, 5), width=3))
ol_min <- findOverlapsOfPeaks(p1, p2, connectedPeaks="min") 
## the counts will be the minimal peaks involved in that group of 
## connected peaks, so you get 2.
ol_merge <- findOverlapsOfPeaks(p1, p2, connectedPeaks="merge") 
## the counts will be 1 for each group of connected peaks
ol_min$venn_cnt
ol_merge$venn_cnt
```


**Q**: Is there a way to show the number of peaks in original peak list?

**A**: Try to set connectedPeaks="keepAll" in `findOverlapsOfPeaks` and 
`makeVennDiagram`.

**Q**: How to extract the original peak IDs of the overlapping peaks?

**A**: In the output of `findOverlapsOfPeaks`, there is a column in the metadata 
of each element in peaklist, called peakNames, which is a CharacterList. 
The CharacterList is a list of the contributing peak ids with prefix, 
eg. peaks1__peakname1, peaksi__peaknamej. Users can access the original peak 
name by split those characters. Here is the sample code:
```{r}
library(ChIPpeakAnno)
bed <- system.file("extdata", "MACS_output.bed", package="ChIPpeakAnno")
gr1 <- toGRanges(bed, format="BED", header=FALSE)
gff <- system.file("extdata", "GFF_peaks.gff", package="ChIPpeakAnno")
gr2 <- toGRanges(gff, format="GFF", header=FALSE, skip=3)
ol <- findOverlapsOfPeaks(gr1, gr2)
peakNames <- ol$peaklist[['gr1///gr2']]$peakNames
library(reshape2)
peakNames1 <- melt(peakNames, value.name="merged.peak.id")
peakNames1 <- cbind(peakNames1[, 1], do.call(rbind, strsplit(as.character(peakNames1[, 3]), "__")))
colnames(peakNames1) <- c("merged.peak.id", "group", "peakName")
head(peakNames1)
gr1.subset <- gr1[peakNames1[peakNames1[, "group"] %in% "gr1", "peakName"]]
gr2.subset <- gr2[peakNames1[peakNames1[, "group"] %in% "gr2", "peakName"]]
```

Here is an alternative way to access the original peak IDs.

```{r}
all.peaks <- ol$all.peaks
gr1.renamed <- all.peaks$gr1
gr2.renamed <- all.peaks$gr2
peakNames <- melt(ol$peaklist[['gr1///gr2']]$peakNames, value.name="merged.peak.id")
gr1.sub <- gr1.renamed[peakNames[grepl("^gr1", peakNames[, 3]), 3]]
gr2.sub <- gr2.renamed[peakNames[grepl("^gr2", peakNames[, 3]), 3]]
```

**Q**: How to select the proper number for totalTest in the function 
`makeVennDiagram`?

**A**: When we test the association between two sets of data based on 
hypergeometric distribution, the number of all potential binding sites is 
required. The parameter _totalTest_ in the function `makeVennDiagram` indicates 
how many potential peaks in total will be used in the hypergeometric test. 
It should be larger than the largest number of peaks in the peak list. 
The smaller it is set, the more stringent the test is. 
The time used to calculate p-value does not depend on the value of the 
_totalTest_. 
For practical guidance on how to choose _totalTest_, please refer to the 
[post](https://stat.ethz.ch/pipermail/bioconductor/2010-November/036540.html).