---
title: "TxRegInfra: support for TxRegQuery"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{TxRegInfra -- classes and methods for TxRegQuery}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes 
    theme: united
    toc: yes 
---

```{r setup,echo=FALSE,results="hide", eval=TRUE}
suppressPackageStartupMessages({
library(TxRegInfra)
library(GenomicFiles)
library(TFutils)
})
```

# Introduction

TxRegQuery addresses exploration of transcriptional regulatory networks
by integrating data on eQTL, digital genomic footprinting (DGF), DnaseI
hypersensitivity binding data (DHS), and transcription
factor binding site (TFBS) data.  Owing to the volume of emerging tissue-specific
data, special data modalities are used.

# Managing bed file content with mongodb

## Querying the `txregnet` database

We have a long-running server that will respond to queries.
We focus on `r CRANpkg("mongolite")` as the interface.

### The connection 

```{r lkmong, eval=TRUE}
suppressPackageStartupMessages({
library(TxRegInfra)
library(mongolite)
library(Gviz)
library(EnsDb.Hsapiens.v75)
library(BiocParallel)
register(SerialParam())
})
con1 = mongo(url=URL_txregInAWS(), db="txregnet")
con1
```
We will write methods that work with the 'fields' of this object.

There is not much explicit reflectance in the mongolite API.
The following is improvised and may be fragile:
```{r lkpar, eval=TRUE}
parent.env(con1)$orig
```

### Queries and aggregation

If the `mongo`
utility is available as a system
command, we can get a list of collections in the database
as follows.
```{r getl, eval=TRUE}
if (verifyHasMongoCmd()) {
  head(c1 <- listAllCollections(url=URL_txregInAWS(), db="txregnet"))
  }
```
Otherwise, as long as `r CRANpkg("mongolite")` is installed,
as long as we know the collection names of interest, we
can use them as noted throughout this vignette.

We can get a record from a given collection:
```{r getl2, eval=TRUE}
mongo(url=URL_txregInAWS(), db="txregnet", 
   collection="Adipose_Subcutaneous_allpairs_v7_eQTL")$find(limit=1)
```
Queries can be composed using JSON.  We have a tool
to generate queries that employ the mongodb aggregation
method.  Here we demonstrate this by computing, for each
chromosome, the count and
minimum values of the footprint statistic on CD14 cells.

```{r doagg, eval=TRUE}
m1 = mongo(url = URL_txregInAWS(), db = "txregnet",  collection="CD14_DS17215_hg19_FP")
newagg = makeAggregator( by="chr", vbl="stat", op="$min", opname="min")
```
The JSON layout of this aggregating query is
```
[
  {
    "$group": {
      "_id": ["$chr"],
      "count": {
        "$sum": [1]
      },
      "min": {
        "$min": ["$stat"]
      }
    }
  }
] 
```
Invocation returns a data frame:
```{r lkagggg, eval=TRUE}
head(m1$aggregate(newagg))
```


# An integrative container

We need to bind the metadata and information about the mongodb.

## Sample metadata

The following turns a very ad hoc filtering of the collection names
into a DataFrame.

```{r getcold, eval=TRUE}
# cd = makeColData() # works when mongo does
cd = TxRegInfra::basicColData
head(cd,2)
```

## Extended RaggedExperiment

```{r domor1, eval=TRUE}
rme0 = RaggedMongoExpt(con1, colData=cd)
rme1 = rme0[, which(cd$type=="FP")]
```

A key method in development is subsetting the archive by genomic coordinates.

```{r lksb, cache=TRUE, eval=TRUE}
si = GenomeInfoDb::Seqinfo(genome="hg19")["chr17"] # to fix query genome
myg = GRanges("chr17", IRanges(38.07e6,38.09e6), seqinfo=si)
s1 = sbov(rme1, myg, simplify=FALSE)
s1
dim(sa <- sparseAssay(s1, 3))  # compact gives segfault
sa[953:956,c("fLung_DS14724_hg19_FP", "fMuscle_arm_DS17765_hg19_FP")]
```

# Visualizing coincidence

```{r mym, eval=TRUE}
ormm = txmodels("ORMDL3", plot=FALSE, name="ORMDL3")
sar = strsplit(rownames(sa), ":|-")
an = as.numeric
gr = GRanges(seqnames(ormm)[1], IRanges(an(sapply(sar,"[", 2)), an(sapply(sar,"[", 3))))
gr1 = gr
gr1$score = 1-sa[,1]
gr2 = gr
gr2$score = 1-sa[,2]
sc1 = DataTrack(gr1, name="Lung FP")
sc2 = DataTrack(gr2, name="Musc/Arm FP")
plotTracks(list(GenomeAxisTrack(), sc1, sc2, ormm), showId=TRUE)
```

# Higher-level work with `sbov`

## Building annotated GRanges for a selected target interval
We begin with three 'single-concept' assays with relevance
to lung genomics.  The v7 GTEx lung eQTL data, an encode DnaseI
narrowPeak report on lung fibroblasts, and a digital genomic
footprint report for fetal lung.
```{r lksbovs}
lname_eqtl = "Lung_allpairs_v7_eQTL"
lname_dhs = "ENCFF001SSA_hg19_HS" # see dnmeta, fibroblast of lung
lname_fp = "fLung_DS14724_hg19_FP"
si17 = GenomeInfoDb::Seqinfo(genome="hg19")["chr17"]
si17n = si17
GenomeInfoDb::seqlevelsStyle(si17n) = "NCBI"
s1 = sbov(rme0[,lname_eqtl], GRanges("17", IRanges(38.06e6, 38.15e6),
    seqinfo=si17n))
s2 = sbov(rme0[,lname_dhs], GRanges("chr17", IRanges(38.06e6, 38.15e6),
   seqinfo=si17))
s3 = sbov(rme0[,lname_fp], GRanges("chr17", IRanges(38.06e6, 38.15e6),
   seqinfo=si17))
```
Now we have annotated GRanges for each assay.  The eQTL data
in part are:
```{r lkeeee}
names(mcols(s1))
head(s1[, c("gene_id", "variant_id", "maf", "pval_nominal")])
```
The names of genes and variants used here are cumbersome -- symbols
and rsids are preferable.
```{r doadd}
addsyms = function(x, EnsDb=EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75) {
  ensids = gsub("\\..*", "", x$gene_id) # remove post period
  gns = genes(EnsDb)
  x$symbol = gns[ensids]$symbol
  x
}
s1 = addsyms(s1)
```
Note that it is possible to retrieve rsids for the SNPs
by address.  But this is a slow operation involving a huge
SNPlocs package that we do not want to work with directly
for this vignette.
```
> snpsByOverlaps(SNPlocs.Hsapiens.dbSNP144.GRCh37, s1b)
UnstitchedGPos object with 265 positions and 2 metadata columns:
        seqnames       pos strand |   RefSNP_id alleles_as_ambig
           <Rle> <integer>  <Rle> | <character>      <character>
    [1]       17  38061054      * |  rs36049276                R
    [2]       17  38061439      * |   rs4795399                Y
    [3]       17  38062196      * |   rs2305480                R
    [4]       17  38062217      * |   rs2305479                Y
    [5]       17  38062503      * |  rs35104165                Y
    ...      ...       ...    ... .         ...              ...
  [261]       17  38149258      * |  rs58212353                K
  [262]       17  38149350      * |   rs8073254                V
  [263]       17  38149411      * |  rs34648856                R
  [264]       17  38149724      * |   rs3785549                Y
  [265]       17  38149727      * |   rs3785550                H
  -------
  seqinfo: 25 sequences (1 circular) from GRCh37.p13 genome
```

## A bipartite graph for eQTL-gene relationships

The object `s1` computed above is available as
`demo_eQTL_granges`.  We convert it to a graph via
```{r lkgr}
library(graph)
g1 = sbov_to_graphNEL(demo_eQTL_granges)
g1
```
Nodes are SNPs and genes, edges are present when
the resource (in this case the GTEx lung study)
declares an association (in this case, an FDR for
SNP-gene association not exceeding 0.10.)  The
`r Biocpkg("graph")` library includes functions
for creation of incidence matrices from graphs, and
vice versa.

## Connecting eQTL-SNPs via DHS and DGF

Given the GRanges representations for `sbov` results,
we can use overlap computations to conveniently
identify relationships between eQTL SNPs, genes,
and hypersensitivity or footprint regions.

We use `sbov_output_HS` as a persistent instance of
`s2` computed above.

```{r doov}
seqlevelsStyle(demo_eQTL_granges) = "UCSC"
fo1 = findOverlaps(demo_eQTL_granges, sbov_output_HS)
fo1 
eq_by_hs = split(demo_eQTL_granges[queryHits(fo1)],
   subjectHits(fo1))
eq_by_hs
```

This shows that there are two DHS sites that overlap
with SNPs showing eQTL associations with various genes.

For the footprint data, we have:
```{r doov2}
fo2 = findOverlaps(demo_eQTL_granges, sbov_output_FP)
fo2 
eq_by_fp = split(demo_eQTL_granges[queryHits(fo2)],
   subjectHits(fo2))
eq_by_fp
```

## Relationships to FIMO-based TFBS

We have a small number of cloud-resident FIMO search
results through the `r Biocpkg("TFutils")` package.

```{r dotfs}
library(TFutils)
data(demo_fimo_granges)
seqlevelsStyle(demo_eQTL_granges) = "UCSC"
lapply(demo_fimo_granges, lapply, function(x) 
   subsetByOverlaps(demo_eQTL_granges, x))
```