--- title: "TCGAutils: Helper functions for working with TCGA datasets" author: "Marcel Ramos & Levi Waldron" date: "`r format(Sys.time(), '%B %d, %Y')`" output: BiocStyle::html_document: number_sections: yes toc: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{TCGAutils Essentials} %\VignetteEncoding{UTF-8} --- # Overview The `TCGAutils` package completes a suite of Bioconductor packages for convenient access, integration, and analysis of *The Cancer Genome Atlas*. It includes: 0. helpers for working with TCGA through the Bioconductor packages `r Biocpkg("MultiAssayExperiment")` (for coordinated representation and manipulation of multi-omits experiments) and `r Biocpkg("curatedTCGAData")`, which provides unrestricted TCGA data as `MultiAssayExperiment` objects, 0. helpers for importing TCGA data as from flat data structures such as `data.frame` or `DataFrame` read from delimited data structures provided by the Broad Institute’s Firehose, GenomicDataCommons, and 0. functions for interpreting TCGA barcodes and for mapping between barcodes and Universally Unique Identifiers (UUIDs). # Installation ```{r, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("TCGAutils") ``` Required packages for this vignette: ```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE} library(TCGAutils) library(curatedTCGAData) library(MultiAssayExperiment) library(RTCGAToolbox) library(BiocFileCache) library(rtracklayer) library(R.utils) ``` # `curatedTCGAData` utility functions Functions such as `getSubtypeMap` and `getClinicalNames` provide information on data inside a `r Biocpkg("MultiAssayExperiment")` object downloaded from `r Biocpkg("curatedTCGAData")`. `sampleTables` and `splitAssays` support useful operations on these `MultiAssayExperiment` objects. ## obtaining TCGA as `MultiAssayExperiment` objects from `curatedTCGAData` For demonstration we download part of the Colon Adenocarcinoma (COAD) dataset using`curatedTCGAData` via `ExperimentHub`. This command will download any data type that starts with `CN*` such as `CNASeq`: ```{r,include=TRUE,results="hide",message=FALSE,warning=FALSE} coad <- curatedTCGAData::curatedTCGAData(diseaseCode = "COAD", assays = c("CNASeq", "Mutation", "miRNA*", "RNASeq2*", "mRNAArray", "Methyl*"), dry.run = FALSE) ``` For a list of all available data types, use `dry.run = FALSE` and an asterisk `*` as the assay input value: ```{r} curatedTCGAData("COAD", "*") ``` ## `sampleTables`: what sample types are present in the data? The `sampleTables` function gives a tally of available samples in the dataset based on the TCGA barcode information. ```{r} sampleTables(coad) ``` For reference in interpreting the sample type codes, see the `sampleTypes` table: ```{r} data("sampleTypes") head(sampleTypes) ``` ## `splitAssays`: separate the data from different tissue types TCGA datasets include multiple -omics for solid tumors, adjacent normal tissues, blood-derived cancers and normals, and other tissue types, which may be mixed together in a single dataset. The `MultiAssayExperiment` object generated here has one patient per row of its `colData`, but each patient may have two or more -omics profiles by any assay, whether due to assaying of different types of tissues or to technical replication. `splitAssays` separates profiles from different tissue types (such as tumor and adjacent normal) into different assays of the `MultiAssayExperiment` by taking a vector of sample codes, and partitioning the current assays into assays with an appended sample code: ```{r} (tnmae <- splitAssays(coad, c("01", "11"))) ``` The `r Biocpkg("MultiAssayExperiment")` package then provides functionality to merge replicate profiles for a single patient (`mergeReplicates()`), which would now be appropriate but would **not** have been appropriate before splitting different tissue types into different assays, because that would average measurements from tumors and normal tissues. `MultiAssayExperiment` also defines the `MatchedAssayExperiment` class, which eliminates any profiles not present across all assays and ensures identical ordering of profiles (columns) in each assay. In this example, it will match tumors to adjacent normals in subsequent assays: ```{r} (matchmae <- as(tnmae[, , c(4, 6, 7)], "MatchedAssayExperiment")) ``` Only about 12 participants have both a matched tumor and solid normal sample. ## `getSubtypeMap`: manually curated molecular subtypes Per-tumor subtypes are saved in the `metadata` of the `colData` slot of `MultiAssayExperiment` objects downloaded from `curatedTCGAData`. These subtypes were manually curated from the supplemental tables of all primary TCGA publications: ```{r} getSubtypeMap(coad) ``` ## `getClinicalNames`: key "level 4" clinical & pathological data The `curatedTCGAData` `colData` contain hundreds of columns, obtained from merging all unrestricted levels of clinical, pathological, and biospecimen data. This function provides the names of "level 4" clinical/pathological variables, which are the only ones provided by most other TCGA analysis tools. Users may then use these variable names for subsetting or analysis, and may even want to subset the `colData` to only these commonly used variables. ```{r} getClinicalNames("COAD") ``` *Warning*: some names may not exactly match the `colData` names in the object due to differences in variable types. These variables are kept separate and differentiated with `x` and `y`. For example, `vital_status` in this case corresponds to two different variables obtained from the pipeline. One variable is interger type and the other character: ```{r} class(colData(coad)[["vital_status.x"]]) class(colData(coad)[["vital_status.y"]]) table(colData(coad)[["vital_status.x"]]) table(colData(coad)[["vital_status.y"]]) ``` Such conflicts should be inspected in this manner, and conflicts resolved by choosing the more complete variable, or by treating any conflicting values as unknown ("NA"). # Converting Assays to SummarizedExperiment This section gives an overview of the operations that can be performed on a given set of metadata obtained particularly from data-rich objects such as those obtained from `curatedTCGAData`. There are several operations that work with microRNA, methylation, mutation, and assays that have gene symbol annotations. ## `CpGtoRanges` Using the methylation annotations in `IlluminaHumanMethylation450kanno.ilmn12.hg19` and the `minfi` package, we look up CpG probes and convert to genomic coordinates with `CpGtoRanges`. The function provides two assays, one with mapped probes and the other with unmapped probes. Excluding unmapped probes can be done by setting the `unmapped` argument to `FALSE`. This will run for both types of methylation data (27k and 450k). ```{r} methcoad <- CpGtoRanges(coad) ``` ## `mirToRanges` microRNA assays obtained from `curatedTCGAData` have annotated sequences that can be converted to genomic ranges using the `mirbase.db` package. The function looks up all sequences and converts them to ('hg19') ranges. For those rows that cannot be found, an 'unranged' assay is introduced in the resulting MultiAssayExperiment object. ```{r} mircoad <- mirToRanges(coad) ``` ## `qreduceTCGA` The `qreduceTCGA` function converts `RaggedExperiment` mutation data objects to `RangedSummarizedExperiment` using `org.Hs.eg.db` and the `qreduceTCGA` utility function from `RaggedExperiment` to summarize 'silent' and 'non-silent' mutations based on a 'Variant_Classification' metadata column in the original object. It uses 'hg19' transcript database ('TxDb') package internally to summarize regions using `qreduceAssay`. The current genome build ('hg18') in the data must be translated to 'hg19'. In this example, we first translate the genome strings in the `RaggedExperiment` `COAD_Mutation-20160128` dataset using `translateBuild`. We do this for the entire vector by using `vapply`. ```{r} rag <- "COAD_Mutation-20160128" genome(coad[[rag]]) <- translateBuild(genome(coad[[rag]])) seqlevelsStyle(coad[[rag]]) <- "UCSC" genome(coad[[rag]]) ``` Now we use `liftOver` from `rtracklayer` to translate 'hg18' builds to 'hg19' using the downloaded chain file. ```{r} lifturl <- "http://hgdownload.cse.ucsc.edu/goldenpath/hg18/liftOver/hg18ToHg19.over.chain.gz" bfc <- BiocFileCache() qfile <- bfcquery(bfc, "18to19chain", exact = TRUE)[["rpath"]] cfile <- if (length(qfile) && file.exists(qfile)) { bfcquery(bfc, "18to19chain", exact = TRUE)[["rpath"]] } else { bfcadd(bfc, "18to19chain", lifturl) } chainfile <- file.path(tempdir(), gsub("\\.gz", "", basename(cfile))) R.utils::gunzip(cfile, destname = chainfile, remove = FALSE) chain <- suppressMessages( rtracklayer::import.chain(chainfile) ) ranges19 <- rtracklayer::liftOver(rowRanges(coad[[rag]]), chain) ``` The same can be done to convert `hg19` to `hg38` (the same build that the Genomic Data Commons uses) with the corresponding chain file: ```{r} liftchain <- "http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz" bfc <- BiocFileCache() q38file <- bfcquery(bfc, "19to38chain", exact = TRUE)[["rpath"]] c38file <- if (length(q38file) && file.exists(q38file)) { bfcquery(bfc, "19to38chain", exact = TRUE)[["rpath"]] } else { bfcadd(bfc, "19to38chain", liftchain) } cloc38 <- file.path(tempdir(), gsub("\\.gz", "", basename(c38file))) R.utils::gunzip(c38file, destname = cloc38, remove = FALSE) chain38 <- suppressMessages( rtracklayer::import.chain(cloc38) ) ## then use the liftOver function using the 'chain38' object ## as above ranges38 <- rtracklayer::liftOver(unlist(ranges19), chain38) ``` This will give us a list of ranges, each element corresponding to a single row in the `RaggedExperiment`. We remove rows that had no matches in the `liftOver` process and replace the ranges in the original `RaggedExperiment` with the replacement method. Finally, we put the `RaggedExperiment` object back into the `MultiAssayExperiment`. ```{r} re19 <- coad[[rag]][as.logical(lengths(ranges19))] ranges19 <- unlist(ranges19) genome(ranges19) <- "hg19" rowRanges(re19) <- ranges19 # replacement coad[["COAD_Mutation-20160128"]] <- re19 rowRanges(re19) ``` Now that we have matching builds, we can finally run the `qreduceTCGA` function. ```{r} coad <- qreduceTCGA(coad, keep.assay = TRUE) ``` ## `symbolsToRanges` In the cases where row annotations indicate gene symbols, the `symbolsToRanges` utility function converts genes to genomic ranges and replaces existing assays with `RangedSummarizedExperiment` objects. Gene annotations are given as 'hg19' genomic regions. ```{r} symbolsToRanges(coad) ``` # Importing TCGA text data files to Bioconductor classes A few functions in the package accept either files or classes such as `data.frame` and `FirehoseGISTIC` as input and return standard Bioconductor classes. ## `makeGRangesListFromExonFiles` The `GenomicDataCommons` package can be used to obtain 'legacy' exon quantification files via: _Note_. File downloads disabled on Windows due to long file names. ```{r} library(GenomicDataCommons) queso <- files(legacy = TRUE) %>% filter( ~ cases.project.project_id == "TCGA-COAD" & data_category == "Gene expression" & data_type == "Exon quantification") gdc_set_cache(directory = tempdir()) ``` We then use `makeGRangesListFromExonFiles` to create a `GRangesList` from vectors of file paths. There are options to provide file names when file names are too long to download (Windows OS). The `nrows` argument only keeps the first 5 rows in each of the files read in due to invalid character exon ranges. ```{r, eval=FALSE} ## FALSE until gdcdata works qu <- manifest(queso) qq <- gdcdata(qu$id[1:4]) makeGRangesListFromExonFiles(qq, nrows = 4) ``` Note `GRangesList` objects must be converted to `r Biocpkg("RaggedExperiment")` class to incorporate them into a `MultiAssayExperiment`. ### Work around for long file names on Windows Due to file name length, Windows may not be able to read / display all files. The workaround uses the `fileNames` argument from a character vector of file names and will convert them to TCGA barcodes. ```{r} ## Load example file found in package pkgDir <- system.file("extdata", package = "TCGAutils", mustWork = TRUE) exonFile <- list.files(pkgDir, pattern = "cation\\.txt$", full.names = TRUE) exonFile ## We add the original file prefix to query for the UUID and get the ## TCGAbarcode filePrefix <- "unc.edu.32741f9a-9fec-441f-96b4-e504e62c5362.1755371." ## Add actual file name manually makeGRangesListFromExonFiles(exonFile, fileNames = paste0(filePrefix, basename(exonFile))) ``` ## `makeGRangesListFromCopyNumber` Other processed, genomic range-based data from TCGA data can be imported using `makeGRangesListFromCopyNumber`. This tab-delimited data file of copy number alterations from *bladder urothelial carcinoma* (BLCA) was obtained from the Genomic Data Commons and is included in `TCGAUtils` as an example: ```{r} grlFile <- system.file("extdata", "blca_cnaseq.txt", package = "TCGAutils") grl <- read.table(grlFile) head(grl) makeGRangesListFromCopyNumber(grl, split.field = "Sample") makeGRangesListFromCopyNumber(grl, split.field = "Sample", keep.extra.columns = TRUE) ``` ## `makeSummarizedExperimentFromGISTIC` This function is only used for converting the `FirehoseGISTIC` class of the `r Biocpkg("RTCGAToolbox")` package. It allows the user to obtain thresholded by gene data, probabilities and peak regions. ```{r} tempDIR <- tempdir() co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE, destdir = tempDIR) selectType(co, "GISTIC") class(selectType(co, "GISTIC")) makeSummarizedExperimentFromGISTIC(co, "Peaks") ``` ## `mergeColData`: expanding the `colData` of a `MultiAssayExperiment` This function merges a `data.frame` or `DataFrame` into the `colData` of an existing `MultiAssayExperiment` object. It will match column names and row names to do a full merge of both data sets. This convenience function can be used, for example, to add subtype information available for a subset of patients to the `colData`. Here is a simplified example of adding a column to the `colData` `DataFrame`: ```{r} race_df <- DataFrame(race_f = factor(colData(coad)[["race"]]), row.names = rownames(colData(coad))) mergeColData(coad, race_df) ``` # Translating and interpreting TCGA identifiers ## Translation The TCGA project has generated massive amounts of data. Some data can be obtained with **U**niversally **U**nique **ID**entifiers (**UUID**) and other data with TCGA barcodes. The Genomic Data Commons provides a JSON API for mapping between UUID and barcode, but it is difficult for many people to understand. `TCGAutils` makes simple functions available for two-way translation between vectors of these identifiers. ### TCGA barcode to UUID Here we translate the first two TCGA barcodes of the previous copy-number alterations dataset to UUID: ```{r} (xbarcode <- head(colnames(coad)[["COAD_CNASeq-20160128_simplified"]], 4L)) barcodeToUUID(xbarcode) ``` ### UUID to TCGA barcode Here we have a known case UUID that we want to translate into a TCGA barcode. ```{r} UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", from_type = "case_id") ``` In cases where we want to translate a known file UUID to the associated TCGA patient barcode, we can use `UUIDtoBarcode`. ```{r} UUIDtoBarcode("0001801b-54b0-4551-8d7a-d66fb59429bf", from_type = "file_id") ``` Translating aliquot UUIDs is also possible by providing a known aliquot UUID to the function and giving a `from_type`, "aliquot_ids": ```{r} UUIDtoBarcode("d85d8a17-8aea-49d3-8a03-8f13141c163b", from_type = "aliquot_ids") ``` Additional UUIDs may be supported in future versions. ### UUID to UUID We can also translate from file UUIDs to case UUIDs and vice versa as long as we know the input type. We can use the case UUID from the previous example to get the associated file UUIDs using `UUIDtoUUID`. Note that this translation is a one to many relationship, thus yielding a `data.frame` of file UUIDs for a single case UUID. ```{r} head(UUIDtoUUID("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", to_type = "file_id")) ``` One possible way to verify that file IDs are matching case UUIDS is to browse to the Genomic Data Commons webpage with the specific file UUID. Here we look at the first file UUID entry in the output `data.frame`: https://portal.gdc.cancer.gov/files/0ff55a5e-6058-4e0b-9641-e3cb375ff214 In the page we check that the case UUID matches the input. ## Parsing TCGA barcodes Several functions exist for working with TCGA barcodes, the main function being `TCGAbarcode`. It takes a TCGA barcode and returns information about participant, sample, and/or portion. ```{r} ## Return participant barcodes TCGAbarcode(xbarcode, participant = TRUE) ## Just return samples TCGAbarcode(xbarcode, participant = FALSE, sample = TRUE) ## Include sample data as well TCGAbarcode(xbarcode, participant = TRUE, sample = TRUE) ## Include portion and analyte data TCGAbarcode(xbarcode, participant = TRUE, sample = TRUE, portion = TRUE) ``` ## Sample select Based on lookup table values, the user can select certain sample types from a vector of sample barcodes. Below we select "Primary Solid Tumors" from a vector of barcodes, returning a logical vector identifying the matching samples. ```{r} ## Select primary solid tumors TCGAsampleSelect(xbarcode, "01") ## Select blood derived normals TCGAsampleSelect(xbarcode, "10") ``` ## `data.frame` representation of barcode The straightforward `TCGAbiospec` function will take the information contained in the TCGA barcode and display it in `data.frame` format with appropriate column names. ```{r} TCGAbiospec(xbarcode) ``` # OncoPrint - oncoPrintTCGA We provide a convenience function that investigates metadata within `curatedTCGAData` objects to present a plot of molecular alterations within a paricular cancer. `MultiAssayExperiment` objects are required to have an identifiable '*Mutation*' assay (using text search). The `variantCol` argument identifies the mutation type column within the data. **Note**. Functionality streamlined from the `ComplexHeatmap` package. ```{r} oncoPrintTCGA(coad, matchassay = rag) ``` # Reference data The `TCGAutils` package provides several helper datasets for working with TCGA barcodes. ## `sampleTypes` As shown previously, the reference dataset `sampleTypes` defines sample codes and their sample types (see `?sampleTypes` for source url). ```{r} ## Obtained previously sampleCodes <- TCGAbarcode(xbarcode, participant = FALSE, sample = TRUE) ## Lookup table head(sampleTypes) ## Match codes found in the barcode to the lookup table sampleTypes[match(unique(substr(sampleCodes, 1L, 2L)), sampleTypes[["Code"]]), ] ``` Source: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes ## `clinicalNames` - Firehose pipeline clinical variables `clinicalNames` is a list of the level 4 variable names (the most commonly used clinical and pathological variables, with follow-ups merged) from each `colData` datasets in `curatedTCGAData`. Shipped `curatedTCGAData` `MultiAssayExperiment` objects merge additional levels 1-3 clinical, pathological, and biospecimen data and contain many more variables than the ones listed here. ```{r} data("clinicalNames") clinicalNames lengths(clinicalNames) ``` # `sessionInfo` ```{r} sessionInfo() ```