VariantExperiment 1.2.0
This vignette is about the conversion methods and statistical
functions that are enabled on VariantExperiment objects, for the
analysis of genotyping or DNA-seq data sets. If you want to learn more
about the implementation of the VariantExperiment class, and basic
methods, please refer to the other vignette.
The package of VariantExperiment is implemented to represent VCF/GDS
files using standard SummarizedExperiment metaphor. It is a container
for high-through genetic/genomic data with GDS back-end, and is
interoperable with the statistical functions/methods that are
implemented in SeqArray and SeqVarTools that are designed for GDS
data. The SummarizedExperiment metaphor also gets the benefit of
common manipulations within Bioconductor ecosystem that are more
user-friendly.
First, we load the package into R session.
library(VariantExperiment)To take advantage of the functions and methods that are defined on
SummarizedExperiment, from which the VariantExperiment extends, we
have defined coercion methods from VCF and GDS to
VariantExperiment.
VCF to VariantExperimentThe coercion function of makeVariantExperimentFromVCF could
convert the VCF file directly into VariantExperiment object. To
achieve the best storage efficiency, the assay data are saved in
GDSArray format, and the annotation data are saved in
DelayedDataFrame format (with no option of ordinary DataFrame),
which could be retrieved by rowData() for feature related
annotations and colData() for sample related annotations (Only when
sample.info argument is specified).
vcf <- SeqArray::seqExampleFileName("vcf")
ve <- makeVariantExperimentFromVCF(vcf)
veInternally, the VCF file was converted into a on-disk GDS file,
which could be retrieved by:
gdsfile(ve)assay data is in GDSArray format:
assay(ve, 1)feature-related annotation is in DelayedDataFrame format:
rowData(ve)User could also have the opportunity to save the sample related
annotation info directly into the VariantExperiment object, by
providing the file path to the sample.info argument, and then
retrieve by colData().
sampleInfo <- system.file("extdata", "Example_sampleInfo.txt",
                          package="VariantExperiment")
ve <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo)
colData(ve)Arguments could be specified to take only certain info columns or format columns from the vcf file.
ve1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP"))
rowData(ve1)In the above example, only 2 info entries (“OR” and “GP”) are read
into the VariantExperiment object.
The start and count arguments could be used to specify the start
position and number of variants to read into Variantexperiment
object.
ve2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000)
ve2For the above example, only 1000 variants are read into the
VariantExperiment object, starting from the position of 101.
GDS to VariantExperimentThe coercion function of makeVariantExperimentFromGDS coerces
GDS files into VariantExperiment objects directly, with the assay
data saved as GDSArray, and the rowData()/colData() in
DelayedDataFrame by default (with the option of ordinary DataFrame
object).
gds <- SeqArray::seqExampleFileName("gds")
ve <- makeVariantExperimentFromGDS(gds)
verowData(ve)
colData(ve)Arguments could be specified to take only certain annotation columns
for features and samples. All available data entries for
makeVariantExperimentFromGDS arguments could be retrieved by the
showAvailable() function with the gds file name as input.
showAvailable(gds)Note that the infoColumns from gds file will be saved as columns
inside the rowData(), with the prefix of
“info_”. rowDataOnDisk/colDataOnDisk could be set as FALSE to
save all annotation data in ordinary DataFrame format.
ve3 <- makeVariantExperimentFromGDS(gds,
                                    rowDataColumns = c("ID", "ALT", "REF"),
                                    infoColumns = c("AC", "AN", "DP"),
                                    rowDataOnDisk = TRUE,
                                    colDataOnDisk = FALSE)
rowData(ve3)  ## DelayedDataFrame object 
colData(ve3)  ## DataFrame objectVariantExperiment supports basic subsetting operations using [,
[[, $, and ranged-based subsetting operations using
subsetByOverlap.
NOTE that after a set of lazy operations, you need to call
saveVariantExperiment function to synchronize the on-disk file
associated with the in-memory representation by providing a file
path. Statistical functions could only work on synchronized
VariantExperiment object, or error will return.
Refer to the “VariantExperiment-class” vignette for more details.
Many statistical functions and methods are defined on
VariantExperiment objects, most of which has their generic defined
in Bioconductor package of SeqArray and SeqVarTools. These
functions could be called directly on VariantExperiment object as
input, with additional arguments to specify based on user’s need. More
details please refer to the vignettes of SeqArray and
SeqVarTools.
Here is a list of the statistical functions with brief description:
| statistical functions | Description | 
|---|---|
| seqAlleleFreq | Calculates the allele frequencies | 
| seqAlleleCount | Calculates the allele counts | 
| seqMissing | Calculates the missing rate for variant/sample | 
| seqNumAllele | Calculates the number of alleles (for ref/alt allele) | 
| hwe | Exact test for Hardy-Weinberg equilibrium on Single-Nucleotide Variants | 
| inbreedCoeff | Calculates the inbreeding coefficient by variant/sample | 
| pca | Calculates the eigenvalues and eignevectors with Principal Component Analysis | 
| titv | Calculate transition/transversion ratio overall or by sample | 
| refDosage | Calculate the dosage of reference allele (matrix with integers of 0/1/2) | 
| altDosage | Calculate the dosage of alternative allele (matrix with integers of 0/1/2) | 
| countSingletons | Count singleton variants for each sample | 
| heterozygosity | Calculate heterozygosity rate by sample or by variants | 
| homozygosity | Calculate homozygosity rate by sample or by variants | 
| meanBySample | Calculate the mean value of a variable by sample over all variants | 
| isSNV | Flag a single nucleotide variant | 
| isVariant | Locate which samples are variant for each site | 
Here are some examples in calculating the sample missing rate, hwe, titv ratio and the count of singletons for each sample.
## sample missing rate
mr.samp <- seqMissing(ve, per.variant = FALSE)
head(mr.samp)
## hwe
hwe <- hwe(ve)
head(hwe)
## titv ratio by sample / overall
titv <- titv(ve, by.sample=TRUE)
head(titv)
titv(ve, by.sample=FALSE)
## countSingletons
countSingletons(ve)As we have noted in the other vignette, after the subsetting by
[, $ or Ranged-based operations, we need to save the new
VariantExperiment object to synchronize the gds file (on-disk)
associated with the subset of data (in-memory representation) before
any statistical analysis. Otherwise, an error will be returned.
As a feature addition, we want to add the option of VCFArray in
saving the assay data in the step of
makeVariantExperimentFromVCF. We also seek to implement the
SQLDataFrame in representation of the annotation data. We also
plan to connect Bioconductor package VariantAnnotation to
implement the variant filtering and annotation functions based on
VariantExperiment format, and with that, to develop a pipeline for
using VariantExperiment object as the basic data structure for
DNA-sequencing data analysis.
sessionInfo()## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] VariantExperiment_1.2.0     DelayedDataFrame_1.4.0     
##  [3] GDSArray_1.8.0              gdsfmt_1.24.0              
##  [5] SummarizedExperiment_1.18.0 DelayedArray_0.14.0        
##  [7] matrixStats_0.56.0          Biobase_2.48.0             
##  [9] GenomicRanges_1.40.0        GenomeInfoDb_1.24.0        
## [11] IRanges_2.22.0              S4Vectors_0.26.0           
## [13] BiocGenerics_0.34.0         BiocStyle_2.16.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.0.0       xfun_0.13              purrr_0.3.4           
##  [4] splines_4.0.0          lattice_0.20-41        generics_0.0.2        
##  [7] vctrs_0.2.4            GWASExactHW_1.01       htmltools_0.4.0       
## [10] yaml_2.2.1             mgcv_1.8-31            rlang_0.4.5           
## [13] pillar_1.4.3           glue_1.4.0             GenomeInfoDbData_1.2.3
## [16] lifecycle_0.2.0        stringr_1.4.0          zlibbioc_1.34.0       
## [19] Biostrings_2.56.0      evaluate_0.14          knitr_1.28            
## [22] SeqArray_1.28.0        broom_0.5.6            Rcpp_1.0.4.6          
## [25] backports_1.1.6        BiocManager_1.30.10    XVector_0.28.0        
## [28] digest_0.6.25          stringi_1.4.6          bookdown_0.18         
## [31] dplyr_0.8.5            grid_4.0.0             tools_4.0.0           
## [34] bitops_1.0-6           magrittr_1.5           RCurl_1.98-1.2        
## [37] tibble_3.0.1           mice_3.8.0             tidyr_1.0.2           
## [40] crayon_1.3.4           SNPRelate_1.22.0       pkgconfig_2.0.3       
## [43] ellipsis_0.3.0         Matrix_1.2-18          assertthat_0.2.1      
## [46] rmarkdown_2.1          logistf_1.23           R6_2.4.1              
## [49] SeqVarTools_1.26.0     nlme_3.1-147           compiler_4.0.0