%\VignetteIndexEntry{ind1KG -- 1000 genomes data demo}
%\VignetteDepends{Rsamtools, chopsticks}
%\VignetteKeywords{Personal genomics}
%\VignettePackage{ind1KG}

%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[12pt]{article}

\usepackage{amsmath,pstricks}
% With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running
% pdflatex will fail. Note that using auto-pst-pdf requires to set environment
% variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in
% texmf.cnf for Tex Live on Unix/Linux/Mac.
\usepackage{auto-pst-pdf}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{Exploring data from the 1000 Genomes project in Bioconductor's \textit{ind1KG} package}
\author{VJ Carey}
\maketitle

\tableofcontents

\section{Overview}

In this document we will look at high-coverage NGS data obtained on NA19240,
because we have the HapMap phase II genotypes (4 mm SNP) for this individual
in GGtools/hmyriB36, and we have an affy 6.0 SNP CEL file for this individual
(and her cohort) as well.

There are three main objectives discussed in this document.
\begin{itemize}
\item We describe how data published in the 1000 genomes (1KG) project
can be imported for investigations using R. This involves the use of the
\Rpackage{Rsamtools} package. We provide serialized instances of various
relevant data elements so that large objects distributed from the project
need not be redistributed for these illustrations.
\item We describe how information on variants can be related to existing
annotation using \Rpackage{rtracklayer} to check for events in regulatory
regions, for example.
\item We discuss how information in the samtools 'pileup' format can be
checked from a statistical perspective to explore how `known' variants in
the sample compare to the putatively newly discovered variants.
\end{itemize}
The reads examined in the document are all from the Illumina sequencing
platform; additional work is sketched facilitating comparison with (released)
read libraries based on 454 or ABI platforms.

\section{External data acquisition}

\subsection{Manual extraction of a multi-Mb chunk}

We will focus on this individual's chromosome 6.
We acquired 
\begin{verbatim}
NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam
\end{verbatim}
and the associated bai and bas files from 
\begin{verbatim}
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/
     NA19240/alignment/NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam.bas
\end{verbatim}
Note that it is possible to work with these files remotely in R, without
moving them to the local machine, thanks to the remote access facilities
built in to samtools and exposed in R.

We use
\begin{verbatim}
samtools view \
    NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam |head -3000000 > na19240_3M.sam
\end{verbatim}
to obtain a parsable text file, presumably of 3 million reads that aligned,
using MAQ, nearest the 5' end of the p arm of chr6. This is because we expect
the bam file to be sorted. We picked the number 3 million out of thin air.

This sam format file can be converted to bam format using the samtools import
facility. We took chromosome 6 reference sequence from

\begin{verbatim}
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/
                         reference/human_b36_female.fa.gz
\end{verbatim}

and indexed it and used
\begin{verbatim}
samtools import femchr6.fa.fai na19240_3M.sam na19240_3M.bam
\end{verbatim}
to generate the bam file.

We imported this into R using Bioconductor's \Rpackage{Rsamtools} with a
straight application of \Rfunction{scanBam}. The result is saved in the
package as \Robject{n240}.
<<getsam>>=
library(ind1KG)
if (!exists("n240"))
  data(n240)
@
This is a list of lists, and we check on the contents of the elements as
follows:
<<lksc>>=
names(n240[[1]])
@
We check the classes:
<<lkccc>>=
sapply(n240[[1]], class)
@
We get a small number of exemplars from each element:
<<lkddd>>=
lapply(n240[[1]], head, 5)
@
We can use R at this point to do matching to reference and filtering and so
forth, but we will only do this in a \textit{post mortem} fashion, as it seems
to make more sense to use samtools directly to do, for example, SNP calling.

\subsection{Programmatic extraction of annotated regions}

(This code segment suggested by Martin Morgan.)  

We can use the \Rpackage{GenomicFeatures} package to obtain intervals defining
various genomic elements.

%dir <- getwd()
%fl <- file.path(dir, "NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam")

<<lkgf>>=
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg18.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg18.knownGene
txdb
@

The \Rfunction{transcripts} method will obtain ranges of transcripts with
constraints.

<<lktr>>=
tx6 <- transcripts(txdb, vals = list(tx_chrom = "chr6"))
chr6a <- head(unique(sort(ranges(tx6))), 50)
chr6a
@

%#fl = "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA19240/alignment/NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam"

With a local BAM file, the following counting procedure is quick.
Note that \Robject{fl} could be a URL beginning
\begin{verbatim}
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA19240/alignment...
\end{verbatim}
and the computations would work, but completion speed would depend upon server
load and network throughput.
<<getrs,eval=FALSE>>=
library(Rsamtools)
p1 <- ScanBamParam(which=RangesList(`6`=chr6a))
fl <- "/mnt/data/stvjc/1000GENOMES/NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam"
unix.time(cnt <- countBam(fl, param=p1))
sum(cnt$records) # 2103439
@

The following scan will yield a list with read and quality information on the 50
transcript regions requested in \Robject{chr6a} allocated to 50 list elements. 
<<dosc,eval=FALSE>>=
res <- scanBam(fl, param=p1)
length(res)
names(res[[1]])
@

\section{Exploring a samtools pileup}

The pileup output derived from the 3 million reads is a 17GB (sic) text file
derived as follows:
\begin{verbatim}
samtools pileup -cf femchr6.fa \
      NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam > na19240F.pup
\end{verbatim}
First 10 lines:
\begin{verbatim}
6       5001    G       G       4       0       0       1       ^!.     C
6       5002    A       A       7       0       0       2       .^!.    <B
6       5003    T       T       7       0       0       2       ..      @?
6       5004    C       C       14      0       0       4       ..^!.^!,        ??B0
6       5005    T       T       4       0       0       5       ...,^!G A?@6+
6       5006    T       T       15      0       0       5       ...,.   BA@<?
6       5007    A       A       15      0       0       5       ...,.   ?@=>>
6       5008    T       T       15      0       0       5       ...,.   AAA9@
6       5009    A       A       15      0       0       5       ...,.   @@@>?
6       5010    T       T       17      0       0       6       ...,.^!.        BAA8AC
\end{verbatim}
The total number of lines is not quite 200 million, so it might be handled
directly in R on a reasonably sized machine. We have isolated the first 10
million records and restricted attention to those locations where the
individual NA19240 differs from the reference sequence.
<<lknd>>=
data(pup240_disc)
head(pup240_disc, 5)
@

Some of these variants are denoted with asterisk, suggesting evidence of
deletion. We will omit these for now. There are also some non-nucleotide-valued
markers, omitted.
<<om>>=
pup240_disc <- pup240_disc[ pup240_disc$ref != "*", ]
pup240_disc <- pup240_disc[ pup240_disc$ref %in% c("A", "C", "T", "G"), ]
table(pup240_disc$indiv)
@

How many of the calls that disagree with reference are present at locations
not already identified as polymorphic by dbSNP?
<<lknd>>=
data(c6snp)
sum( !( pup240_disc$loc  %in% c6snp$chrPosFrom ) )
@

How many of these possibly novel variants are sites of heterozygosity?
<<lkh>>=
nov <- pup240_disc[ !( pup240_disc$loc  %in% c6snp$chrPosFrom ), ]
table(nov$indiv)
@

\section{Checking samtools-based calls against other calls}

\subsection{HapMap Phase II calls}

We include information from the phase II HapMap calls for NA19240.
We have a \Rclass{snp.matrix} instance with the full genotyping for
chromosome 6 and location information as supplied by HapMap.
<<lkd>>=
library(chopsticks)
data(yri240_6)
names(yri240_6)
@

The following code gets all relevant HapMap calls in a generic format
and isolates the SNP at which NA19240 is heterozygous.
<<lkhhh>>=
snps <- as(yri240_6[[1]], "character")
hets <- which(snps == "A/B") 
rshet <- colnames(snps)[hets]
smet <- yri240_6[[2]]
smethet <- smet[hets,]
head(smethet, 5)
@

We also have the full pileup information for the first 500K loci computed
by samtools pileup.
<<lkpi>>=
data(pup240_500k)
head(pup240_500k, 2)
@
This include some duplicated locations, which we remove.
<<lkdu>>=
pup240_500ku <- pup240_500k[ !duplicated(pup240_500k[,1]),]
@
The pileups at which HapMap says our subject is heterozygous are
<<hp>>=
hpup <- pup240_500ku[ pup240_500ku[,1] %in% smethet[,"Position"], ]
@
Are there any loci (in this very small region of chromosome 6) that HapMap
says are heterozygous, but that are found to be homozygous by sequencing?
<<lkhom>>=
hom <- hpup[ hpup[,2] == hpup[,3], ]
hom
smethet[ smethet[,"Position"] %in% hom[,1], ]
@

\subsection{Affy SNP 6.0 chip calls}

We ran \Rfunction{crlmm} to genotype all 90 YRI samples from 6.0 chips
distributed by Affymetrix. The data for NA19240 chromosome 6 are available
in the \Rpackage{ind1KG} package:
<<getd>>=
data(gw6c6.snp240)
head(gw6c6.snp240, 4)
@
The heterozygous loci are
<<lkloc>>=
hloc6 <- gw6c6.snp240[ gw6c6.snp240$call240 == 2, "physical_pos" ]
@
Let's see if the sequencing leads to the same decisions (at least with
regard to heterozygous vs. homozygous):
<<lkll>>=
inds <- which(pup240_500k[,1] %in% hloc6)
table(pup240_500k[inds, 3])
@
For the loci called homozygous by crlmm, we have:
<<lko>>=
oloc6 <- gw6c6.snp240[ gw6c6.snp240$call240 %in% c(1,3), "physical_pos" ]
oinds <- which(pup240_500k[,1] %in% oloc6)
table(pup240_500k[oinds, 3])
@


\section{Relating possibly novel variants to existing annotation}

\subsection{Browser-based visualization}

There are many ways to use Bioconductor annotation resources to learn about
contexts of variants. However, the UCSC genome browser is probably the most
efficient place to start. We can convert our vector of locations of
apparently new variants to a browser track as follows; this code is not
executed in vignette construction, but you may run it manually if suitably
networked.

<<dobr,eval=FALSE>>=
library(IRanges)
nloc <- nov$loc
nrd <- RangedData( IRanges(nloc, nloc) )
names(nrd) <- "chr6"
library(rtracklayer)
br <- browserSession("UCSC")
br[["novo"]] <- nrd
v1 <- browserView(br, GenomicRanges(1, 1e7, "chr6"))
@

This arranges the browser so that the custom track at the top of the display,
'novo', gives the locations of the possible novel variants.

Overall, we see that these novel variants occur regularly across the 10MB.

\setkeys{Gin}{width=0.99\textwidth}
\includegraphics{fullnovo}

\clearpage

We can zoom in to the region around a given gene, here SERPINEB6.

\vspace*{5mm}

\setkeys{Gin}{width=0.99\textwidth}
\includegraphics{serpineB6novo}

\subsection{Browser-based annotation extraction and comparison}

Because the \Rpackage{rtracklayer} package gives a bidirectional interface,
it is possible to programmatically check for coincidence of variant locations,
gene regions, or regulatory elements, for example.

We can learn the names of all available tracks for the current session via
code like the following.
<<lrnam, eval=FALSE, keep.source=TRUE>>=
tn <- trackNames(br)
grep("Genes", names(tn), value=TRUE) # many different gene sets
tn["UCSC Genes"]   # resolve indirection
@

For example, to get the symbols for genes in the 10 million bp excerpt that
we are working with, we can use
<<doal,eval=FALSE>>=
rsg <- track(br, "refGene")
rsgdf <- as.data.frame(rsg)
@
This data frame has been serialized with the \Rpackage{ind1KG} package.
<<lkrs>>=
data(rsgdf)
names(rsgdf)
rsgdf[1:3,1:7]
@

We see that the 'names' here are RefSeq identifiers. We may be able to
resolve them to Entrez Gene Ids, and thence to symbols, as follows:
<<lkres>>=
library(org.Hs.eg.db)
rsgn <- as.character(rsgdf$name)
eid <- mget(rsgn, revmap(org.Hs.egREFSEQ), ifnotfound=NA)
eid <- na.omit(unlist(eid))
sym <- mget(eid, org.Hs.egSYMBOL, ifnotfound=NA)
head(unlist(sym), 10)
@
These names are consistent with what we see on the browser displays
shown above.

We can use the \Rpackage{IRanges} infrastructure to check for intersection
between novel variant locations and gene occupancy regions.
<<zzz,keep.source=TRUE>>=
nloc <- nov$loc  # this one is evaluated
nranges <- IRanges(nloc, nloc)
granges <- IRanges(rsgdf$start, rsgdf$end) # no guarantee of annotation
length(nranges)
length(granges)
sum(nranges %in% granges)
head(match(nranges,granges), 200)
@
We can see that there is a batch of variants present in the first gene, and
this is confirmed by checking the 1KG browser.

\vspace*{5mm}

\includegraphics{dusp22clust}

\vspace*{5mm}
Looking in more detail, we have

\vspace*{5mm}
\includegraphics{detailDUSP22}

and this can be exploded into the Ensembl variant browser view

\vspace*{5mm}
\includegraphics{ensSNPdusp22}

with textual metadata view

\vspace*{5mm}
\includegraphics{ensTextDUSP22}

\clearpage

So it seems DUSP22 resides over plenty of known SNP; our computations are
supposed to reveal hitherto unknown variants in this region for this
individual.

\subsection{Exercises}
\begin{enumerate}
\item The \Robject{oregdf} \Rclass{data.frame} is supplied in
\Rpackage{ind1KG}, containing information on regulatory elements annotated
in oreganno. How many novel variants for NA19240 lie in oreganno regulatory
regions? What types of regions are occupied?
\item Derive a \Rclass{data.frame} for regions of nucleosome occupancy in our
10 Mb segment and check how many of the novel variants lie in such regions.
\end{enumerate}


\end{document}