%\VignetteIndexEntry{ind1KG -- 1000 genomes data demo} %\VignetteDepends{Rsamtools, chopsticks} %\VignetteKeywords{Personal genomics} %\VignettePackage{ind1KG} % % NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % \documentclass[12pt]{article} \usepackage{amsmath,pstricks} % With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running % pdflatex will fail. Note that using auto-pst-pdf requires to set environment % variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in % texmf.cnf for Tex Live on Unix/Linux/Mac. \usepackage{auto-pst-pdf} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \bibliographystyle{plainnat} \begin{document} %\setkeys{Gin}{width=0.55\textwidth} \title{Exploring data from the 1000 Genomes project in Bioconductor's \textit{ind1KG} package} \author{VJ Carey} \maketitle \tableofcontents \section{Overview} In this document we will look at high-coverage NGS data obtained on NA19240, because we have the HapMap phase II genotypes (4 mm SNP) for this individual in GGtools/hmyriB36, and we have an affy 6.0 SNP CEL file for this individual (and her cohort) as well. There are three main objectives discussed in this document. \begin{itemize} \item We describe how data published in the 1000 genomes (1KG) project can be imported for investigations using R. This involves the use of the \Rpackage{Rsamtools} package. We provide serialized instances of various relevant data elements so that large objects distributed from the project need not be redistributed for these illustrations. \item We describe how information on variants can be related to existing annotation using \Rpackage{rtracklayer} to check for events in regulatory regions, for example. \item We discuss how information in the samtools 'pileup' format can be checked from a statistical perspective to explore how `known' variants in the sample compare to the putatively newly discovered variants. \end{itemize} The reads examined in the document are all from the Illumina sequencing platform; additional work is sketched facilitating comparison with (released) read libraries based on 454 or ABI platforms. \section{External data acquisition} \subsection{Manual extraction of a multi-Mb chunk} We will focus on this individual's chromosome 6. We acquired \begin{verbatim} NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam \end{verbatim} and the associated bai and bas files from \begin{verbatim} ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/ NA19240/alignment/NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam.bas \end{verbatim} Note that it is possible to work with these files remotely in R, without moving them to the local machine, thanks to the remote access facilities built in to samtools and exposed in R. We use \begin{verbatim} samtools view \ NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam |head -3000000 > na19240_3M.sam \end{verbatim} to obtain a parsable text file, presumably of 3 million reads that aligned, using MAQ, nearest the 5' end of the p arm of chr6. This is because we expect the bam file to be sorted. We picked the number 3 million out of thin air. This sam format file can be converted to bam format using the samtools import facility. We took chromosome 6 reference sequence from \begin{verbatim} ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/ reference/human_b36_female.fa.gz \end{verbatim} and indexed it and used \begin{verbatim} samtools import femchr6.fa.fai na19240_3M.sam na19240_3M.bam \end{verbatim} to generate the bam file. We imported this into R using Bioconductor's \Rpackage{Rsamtools} with a straight application of \Rfunction{scanBam}. The result is saved in the package as \Robject{n240}. <>= library(ind1KG) if (!exists("n240")) data(n240) @ This is a list of lists, and we check on the contents of the elements as follows: <>= names(n240[[1]]) @ We check the classes: <>= sapply(n240[[1]], class) @ We get a small number of exemplars from each element: <>= lapply(n240[[1]], head, 5) @ We can use R at this point to do matching to reference and filtering and so forth, but we will only do this in a \textit{post mortem} fashion, as it seems to make more sense to use samtools directly to do, for example, SNP calling. \subsection{Programmatic extraction of annotated regions} (This code segment suggested by Martin Morgan.) We can use the \Rpackage{GenomicFeatures} package to obtain intervals defining various genomic elements. %dir <- getwd() %fl <- file.path(dir, "NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam") <>= library(GenomicFeatures) library(TxDb.Hsapiens.UCSC.hg18.knownGene) txdb = TxDb.Hsapiens.UCSC.hg18.knownGene txdb @ The \Rfunction{transcripts} method will obtain ranges of transcripts with constraints. <>= tx6 <- transcripts(txdb, vals = list(tx_chrom = "chr6")) chr6a <- head(unique(sort(ranges(tx6))), 50) chr6a @ %#fl = "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA19240/alignment/NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam" With a local BAM file, the following counting procedure is quick. Note that \Robject{fl} could be a URL beginning \begin{verbatim} ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA19240/alignment... \end{verbatim} and the computations would work, but completion speed would depend upon server load and network throughput. <>= library(Rsamtools) p1 <- ScanBamParam(which=RangesList(`6`=chr6a)) fl <- "/mnt/data/stvjc/1000GENOMES/NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam" unix.time(cnt <- countBam(fl, param=p1)) sum(cnt$records) # 2103439 @ The following scan will yield a list with read and quality information on the 50 transcript regions requested in \Robject{chr6a} allocated to 50 list elements. <>= res <- scanBam(fl, param=p1) length(res) names(res[[1]]) @ \section{Exploring a samtools pileup} The pileup output derived from the 3 million reads is a 17GB (sic) text file derived as follows: \begin{verbatim} samtools pileup -cf femchr6.fa \ NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam > na19240F.pup \end{verbatim} First 10 lines: \begin{verbatim} 6 5001 G G 4 0 0 1 ^!. C 6 5002 A A 7 0 0 2 .^!. > 6 5008 T T 15 0 0 5 ...,. AAA9@ 6 5009 A A 15 0 0 5 ...,. @@@>? 6 5010 T T 17 0 0 6 ...,.^!. BAA8AC \end{verbatim} The total number of lines is not quite 200 million, so it might be handled directly in R on a reasonably sized machine. We have isolated the first 10 million records and restricted attention to those locations where the individual NA19240 differs from the reference sequence. <>= data(pup240_disc) head(pup240_disc, 5) @ Some of these variants are denoted with asterisk, suggesting evidence of deletion. We will omit these for now. There are also some non-nucleotide-valued markers, omitted. <>= pup240_disc <- pup240_disc[ pup240_disc$ref != "*", ] pup240_disc <- pup240_disc[ pup240_disc$ref %in% c("A", "C", "T", "G"), ] table(pup240_disc$indiv) @ How many of the calls that disagree with reference are present at locations not already identified as polymorphic by dbSNP? <>= data(c6snp) sum( !( pup240_disc$loc %in% c6snp$chrPosFrom ) ) @ How many of these possibly novel variants are sites of heterozygosity? <>= nov <- pup240_disc[ !( pup240_disc$loc %in% c6snp$chrPosFrom ), ] table(nov$indiv) @ \section{Checking samtools-based calls against other calls} \subsection{HapMap Phase II calls} We include information from the phase II HapMap calls for NA19240. We have a \Rclass{snp.matrix} instance with the full genotyping for chromosome 6 and location information as supplied by HapMap. <>= library(chopsticks) data(yri240_6) names(yri240_6) @ The following code gets all relevant HapMap calls in a generic format and isolates the SNP at which NA19240 is heterozygous. <>= snps <- as(yri240_6[[1]], "character") hets <- which(snps == "A/B") rshet <- colnames(snps)[hets] smet <- yri240_6[[2]] smethet <- smet[hets,] head(smethet, 5) @ We also have the full pileup information for the first 500K loci computed by samtools pileup. <>= data(pup240_500k) head(pup240_500k, 2) @ This include some duplicated locations, which we remove. <>= pup240_500ku <- pup240_500k[ !duplicated(pup240_500k[,1]),] @ The pileups at which HapMap says our subject is heterozygous are <>= hpup <- pup240_500ku[ pup240_500ku[,1] %in% smethet[,"Position"], ] @ Are there any loci (in this very small region of chromosome 6) that HapMap says are heterozygous, but that are found to be homozygous by sequencing? <>= hom <- hpup[ hpup[,2] == hpup[,3], ] hom smethet[ smethet[,"Position"] %in% hom[,1], ] @ \subsection{Affy SNP 6.0 chip calls} We ran \Rfunction{crlmm} to genotype all 90 YRI samples from 6.0 chips distributed by Affymetrix. The data for NA19240 chromosome 6 are available in the \Rpackage{ind1KG} package: <>= data(gw6c6.snp240) head(gw6c6.snp240, 4) @ The heterozygous loci are <>= hloc6 <- gw6c6.snp240[ gw6c6.snp240$call240 == 2, "physical_pos" ] @ Let's see if the sequencing leads to the same decisions (at least with regard to heterozygous vs. homozygous): <>= inds <- which(pup240_500k[,1] %in% hloc6) table(pup240_500k[inds, 3]) @ For the loci called homozygous by crlmm, we have: <>= oloc6 <- gw6c6.snp240[ gw6c6.snp240$call240 %in% c(1,3), "physical_pos" ] oinds <- which(pup240_500k[,1] %in% oloc6) table(pup240_500k[oinds, 3]) @ \section{Relating possibly novel variants to existing annotation} \subsection{Browser-based visualization} There are many ways to use Bioconductor annotation resources to learn about contexts of variants. However, the UCSC genome browser is probably the most efficient place to start. We can convert our vector of locations of apparently new variants to a browser track as follows; this code is not executed in vignette construction, but you may run it manually if suitably networked. <>= library(IRanges) nloc <- nov$loc nrd <- RangedData( IRanges(nloc, nloc) ) names(nrd) <- "chr6" library(rtracklayer) br <- browserSession("UCSC") br[["novo"]] <- nrd v1 <- browserView(br, GenomicRanges(1, 1e7, "chr6")) @ This arranges the browser so that the custom track at the top of the display, 'novo', gives the locations of the possible novel variants. Overall, we see that these novel variants occur regularly across the 10MB. \setkeys{Gin}{width=0.99\textwidth} \includegraphics{fullnovo} \clearpage We can zoom in to the region around a given gene, here SERPINEB6. \vspace*{5mm} \setkeys{Gin}{width=0.99\textwidth} \includegraphics{serpineB6novo} \subsection{Browser-based annotation extraction and comparison} Because the \Rpackage{rtracklayer} package gives a bidirectional interface, it is possible to programmatically check for coincidence of variant locations, gene regions, or regulatory elements, for example. We can learn the names of all available tracks for the current session via code like the following. <>= tn <- trackNames(br) grep("Genes", names(tn), value=TRUE) # many different gene sets tn["UCSC Genes"] # resolve indirection @ For example, to get the symbols for genes in the 10 million bp excerpt that we are working with, we can use <>= rsg <- track(br, "refGene") rsgdf <- as.data.frame(rsg) @ This data frame has been serialized with the \Rpackage{ind1KG} package. <>= data(rsgdf) names(rsgdf) rsgdf[1:3,1:7] @ We see that the 'names' here are RefSeq identifiers. We may be able to resolve them to Entrez Gene Ids, and thence to symbols, as follows: <>= library(org.Hs.eg.db) rsgn <- as.character(rsgdf$name) eid <- mget(rsgn, revmap(org.Hs.egREFSEQ), ifnotfound=NA) eid <- na.omit(unlist(eid)) sym <- mget(eid, org.Hs.egSYMBOL, ifnotfound=NA) head(unlist(sym), 10) @ These names are consistent with what we see on the browser displays shown above. We can use the \Rpackage{IRanges} infrastructure to check for intersection between novel variant locations and gene occupancy regions. <>= nloc <- nov$loc # this one is evaluated nranges <- IRanges(nloc, nloc) granges <- IRanges(rsgdf$start, rsgdf$end) # no guarantee of annotation length(nranges) length(granges) sum(nranges %in% granges) head(match(nranges,granges), 200) @ We can see that there is a batch of variants present in the first gene, and this is confirmed by checking the 1KG browser. \vspace*{5mm} \includegraphics{dusp22clust} \vspace*{5mm} Looking in more detail, we have \vspace*{5mm} \includegraphics{detailDUSP22} and this can be exploded into the Ensembl variant browser view \vspace*{5mm} \includegraphics{ensSNPdusp22} with textual metadata view \vspace*{5mm} \includegraphics{ensTextDUSP22} \clearpage So it seems DUSP22 resides over plenty of known SNP; our computations are supposed to reveal hitherto unknown variants in this region for this individual. \subsection{Exercises} \begin{enumerate} \item The \Robject{oregdf} \Rclass{data.frame} is supplied in \Rpackage{ind1KG}, containing information on regulatory elements annotated in oreganno. How many novel variants for NA19240 lie in oreganno regulatory regions? What types of regions are occupied? \item Derive a \Rclass{data.frame} for regions of nucleosome occupancy in our 10 Mb segment and check how many of the novel variants lie in such regions. \end{enumerate} \end{document}