Modelling methylation profiles

Andreas C. Kapourani1* and Guido Sanguinetti1**

1School of Informatics, University of Edinburgh, UK

*c.a.kapourani@ed.ac.uk or kapouranis.andreas@gmail.com
**G.Sanguinetti@ed.ac.uk

2 May 2019

Abstract

Learn methylation profiles using basis function regression for different observation models. The inferred profiles can be used for downstream analysis, such as prediciting gene expression levels or clustering regions with similar methylation profiles. This vignette focuses on the functionality and the usage of the BPRMeth package. Mathematical details and modelling assumptions can be found in [2] and on the following link http://rpubs.com/cakapourani

Package

BPRMeth 1.10.0

1 Installation

## try http:// if https:// URLs are not supported
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("BPRMeth")

## Or download from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/BPRMeth", build_vignettes = TRUE)

2 Introduction

DNA methylation is probably the best studied epigenomic mark, due to its well established heritability and widespread association with diseases. Yet its role in gene regulation, and the molecular mechanisms underpinning its association with diseases, are still imperfectly understood. While the methylation status of individual cytosines can sometimes be informative, several recent papers have shown that the functional role of DNA methylation is better captured by a quantitative analysis of the spatial variation of methylation across a genomic region.

The BPRMeth package is a probabilistic method to quantify explicit features of methylation profiles, in a way that would make it easier to formally use such profiles in downstream modelling efforts, such as predicting gene expression levels or clustering genomic regions according to their methylation profiles. The original implementation [1] has now been enhanced in two important ways: we introduced a fast, variational inference approach which enables the quantification of Bayesian posterior confidence measures on the model, and we adapted the method to use several observation models, making it suitable for a diverse range of platforms including single-cell analyses and methylation arrays. Technical details of the updated version are explained in [2].

In addition to being a flexible tool for methylation data, the proposed framework is in principle deployable to other measurements with a similar structure, and indeed the method was already used for single-cell chromatin accessibility data in [3].

3 Background

Mathematically, BPRMeth is based on a basis function generalised linear model. The basic idea is as follows: the methylation profile associated to a genomic region \(D\) is defined as a (latent) function \(f\colon D\rightarrow (0,1)\) which takes as input the genomic coordinate along the region and returns the propensity for that locus to be methylated. In order to enforce spatial smoothness, and to obtain a compact representation for this function in terms of interpretable features, we represent the profile function as a linear combination of basis functions

\[ f(x)=\Phi\left(\mathbf{w}^Th(x)\right) \]

where \(h(x)\) are the basis functions (Gaussian bells by default), and \(\Phi\) is a probit transformation (Gaussian cumulative distribution function) needed in order to map the function output to the \((0,1)\) interval. The latent function is observed at specific loci through a noise model which encapsulates the experimental technology.

Figure 1: Illustration of the process for infering methylation profiles using the BPRMeth package

The optimal weight parameters \(\mathbf{w}\) can be recovered either by Bayesian inference or maximum likelihood estimation, providing a set of quantitative features which can be used in downstream models: in [1] these features were used in a machine learning predictor of gene expression, and to cluster genomic regions according to their methylation profiles. Modelling details and mathematical derivations for the different models can be found online: http://rpubs.com/cakapourani.

4 Analysis Pipeline

The workflow diagram and functionalities of the BPRMeth package for analysis of methylation profiles are shown in Figure 2.

Figure 2: Workflow diagram and functionalities of the BPRMeth package

4.1 Sample data

To illustrate the functionality of the BPRMeth package we will use real datasets that are publicly available from the ENCODE project consortium [4]. More specifically we will focus on the K562 immortalized cell line, with GEO: GSE27584 for the bulk RRBS data and GEO: GSE33480 for the RNA-Seq data. We will use directly the preprocessed files, where we have kept only the protein coding genes, and for the purpose of this vignette we focus only on chr12 and chr13. Full details of where to download the data and how to preprocess them are included in the Supplementary Material [1].

4.2 Load preprocessed data

Due to its general approach, BPRMeth can be used for a wide range of technologies such as array based, bulk sequencing (both RRBS and WGBS) and single cell sequencing methylation datasets. It is only required that the data are in the right format and we call the appropriate observation model depending on the type of technology. Since the encode data are generated from bulk RRBS experiments, the observation model for the BPRMeth package will be Binomial.

4.2.1 Methylation data

The BPRMeth package provides methods for reading files for methylation, annotation and expression data, however they do not cover all possible data formats available for describing biological datasets. The user can implement his own methods for reading files with different formats, provided that he can create an object similar to what is described below. First, we load the preprocessed methylation data as follows:

library(BPRMeth)  # Load package
met_region <- encode_met

The met_region object contains the following important slots:

met: A list containing the methylation region, where each element of the list is a different genomic region. More specifically, each methylation promoter region is an \(L_{i} \times 3\) dimensional matrix, where \(L_{i}\) denotes the number of CpGs found in region \(i\). The columns contain the following information:
- 1st column: Contains the locations of CpGs relative to central location. Note that the actual locations are scaled to the (-1, 1) region.
- 2nd column: Contains the total reads of each CpG.
- 3rd column: Contains the methylated reads each CpG.
anno: Corresponding annotation data for each genomic region as GRanges object. The anno slot is only required for downstream analysis, e.g. to predict gene expression levels from methylation profiles, since we need to map the genomic regions with the gene IDs.

Now let’s observe the format of the met_region object. For example, we can access the \(20^{th}\) promoter methylation region as follows

head(met_region$met[[20]])

##         [,1] [,2] [,3]
## [1,] -0.8787   42    1
## [2,] -0.8779   42    0
## [3,] -0.8770   42    2
## [4,] -0.8759   42    0
## [5,] -0.8754   42    0
## [6,] -0.8667    9    1

Below are the annotation data

head(met_region$anno)

## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames          ranges strand |              id    centre
##          <Rle>       <IRanges>  <Rle> |     <character> <numeric>
##   [1]    chr12   562529-576529      + | ENSG00000139044    569529
##   [2]    chr12   745147-759147      + | ENSG00000177406    752147
##   [3]    chr12 1051888-1065888      - | ENSG00000002016   1058888
##   [4]    chr12 1696331-1710331      - | ENSG00000171823   1703331
##   [5]    chr12 2020870-2034870      - | ENSG00000151062   2027870
##   [6]    chr12 2897118-2911118      + | ENSG00000004478   2904118
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

and the total number of genomic regions are

NROW(met_region$anno)

## [1] 339

4.2.2 Read methylation and annotation data (Do not run)

In the previous section for simplicity we provided a pre-processed object which we then can directly use for inferring methylation profiles. The steps required to create this object are really simple and follow the diagram in Figure 2. First we need to read the methylation data using the read_met function. Also we need read annotation data using the read_anno file. Note that the annotation file can contain any genomic context: from promoters and gene bodies to enhancers, Nanog regulatory regions and CTCF regions; hence BPRMeth can be used for a plethora of analyses that want to take spatial genomic correlations into account. Finally, the create_region_object will create the methylation regions object which is the main object for storing methylation data. The following lines of code provide an example of creating this object ( Do not run )

# Read bulk methylation data
met_dt <- read_met(file = "met_file", type = "bulk_seq")
# Read annotation file, which will create annotation regions as well
anno_dt <- read_anno(file = "anno_file")
# Create methylation regions that have CpG coverage
met_region <- create_region_object(met_dt = met_dt, anno_dt = anno_dt)

4.2.3 Expression data

If the goal is to relate methylation profiles to gene expression levels, we need also to store expression data. Again, here we have pre-processed expression data which can be loaded as follows

expr <- encode_expr

The structure of the expression data is rather simple, it is a data.table with two columns as shown below

head(expr)

##                 id      expr
## 1: ENSG00000230032 -3.321928
## 2: ENSG00000226210  3.833477
## 3: ENSG00000225143 -1.967380
## 4: ENSG00000234465 -3.321928
## 5: ENSG00000206114 -3.321928
## 6: ENSG00000235591 -3.321928

To create the expression data (which are \(log_{2}\) transformed) we could call ( Do not run )

# Read expression data and log2 transform them
expr <- read_expr(file = "expr_file", log2_transf = TRUE)

4.3 Create basis objects

For each genomic region, we will learn a methylation profile using the Binomial observation model with a specified number of Radial Basis Functions (RBFs) \(M\). For a single input variable \(x\), the RBF takes the form \(h_{j}(x) = exp(-\gamma || x - \mu_{j} ||^2)\), where \(\mu_{j}\) denotes the location of the \(j^{th}\) basis function in the input space and \(\gamma\) controls the spatial scale. The case when \(M = 0\) is equivalent to learning the mena methylation rate for the given region (i.e. learn a constant function).

For our running example, we will create two RBF objects, one with 5 basis functions and the other with 0 basis functions denoting the mean methylation rate

# Create RBF basis object with 5 RBFs
basis_profile <- create_rbf_object(M = 5)
# Create RBF basis object with 0 RBFs, i.e. constant function
basis_mean <- create_rbf_object(M = 0)

The rbf object contains information such as the centre locations \(\mu_{j}\) and the value of the spatial scale parameter \(\gamma\)

# Show the slots of the 'rbf' object
basis_profile

## $M
## [1] 5
## 
## $mus
## [1] -0.6666667 -0.3333333  0.0000000  0.3333333  0.6666667
## 
## $gamma
## [1] 6.25
## 
## $eq_spaced_mus
## [1] TRUE
## 
## $whole_region
## [1] TRUE
## 
## attr(,"class")
## [1] "rbf"

The \(\gamma\) is computed by the number of bases \(M\), however the user can tune it according to his liking. Except from RBF basis, the BPRMeth pacakge provides polynomial and Fourier basis functions which can be created with the create_polynomial_object and create_fourier_object functions, respectively.

4.4 Inferring methylation profiles

Learning the methylation profiles is equivalent to inferring the model parameters \(\mathbf{W}\) of the generalised linear regression model. These parameters can be considered as the extracted features which quantitate precisely notions of shape of a methylation profile.

For performing parameter inference, the BPRMeth package is enhanced to provide both maximum likelihood estimation and Bayesian inference for the parameters. Specifically, for Bayesian estimation it provides an efficient mean-field variational inference (Variational Bayes) and a Gibbs sampling algorithm. For computational efficiency one should choose the VB implementation (Note that for Beta observation models, one should use MLE to infer the profiles, since currently the package does not provide Bayesian treatment for Beta likelihood).

To infer the methylation profiles using VB we need to call the infer_profiles_vb function

fit_profiles <- infer_profiles_vb(X = met_region$met, model = "binomial",
                                basis = basis_profile, is_parallel = FALSE)

fit_mean <- infer_profiles_vb(X = met_region$met, model = "binomial", 
                              basis = basis_mean, is_parallel = FALSE)

This results in a infer_profiles object whose most important slot is the inferred weight matrix W, below we show the structure of the fit_profiles object

# Show structure of fit_profiles object
str(fit_profiles, max.level = 1)

## List of 9
##  $ W            : num [1:339, 1:6] 0.546 6.128 -0.235 -0.389 -0.172 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ W_Sigma      :List of 339
##  $ alpha        : num [1:339, 1] 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 ...
##  $ beta         : num [1:339, 1] 35 35 3.887 1.03 0.393 ...
##  $ basis        :List of 5
##   ..- attr(*, "class")= chr "rbf"
##  $ nll_feat     : num [1:339, 1] 61.6 101 88.9 90 30.2 ...
##  $ rmse_feat    : num [1:339, 1] 0.1055 0.0922 0.2108 0.1311 0.2783 ...
##  $ coverage_feat: int [1:339, 1] 29 26 17 41 13 47 16 16 38 21 ...
##  $ lb_feat      : num [1:339, 1] -506 -185.3 -149.6 -229 -98.7 ...
##  - attr(*, "class")= chr [1:3] "infer_profiles_vb_binomial" "infer_profiles_vb" "infer_profiles"

Below we show an example promoter region together with the fitted methylation profiles, where the points represent the DNA methylation level of each CpG site. The shape of the methylation profiles is captured by the blue curve, whereas the red line denotes the mean methylation rate

# Choose promoter region 21 -> i.e. LEPREL2 gene
gene_id <- met_region$anno$id[21]
p <- plot_infer_profiles(region = 21, obj_prof = fit_profiles,
                         obj_mean = fit_mean, obs = met_region$met, 
                         title = paste0("Gene ID ", gene_id))
print(p)

Figure 3: Inferring methylation profiles
Methylation pattern for specific genomic region over +/-7kb promoter region.

4.5 Applications

4.5.1 Predicting gene expression

To quantitatively predict expression at each region, we construct a regression model by taking as input the higher-order methylation features learned from the infer_profiles_vb. In addition to these features, we consider two supplementary sources of information: (1) the goodness of fit in RMSE and (2) the CpG density. For our analysis an SVM regression model is considered. We will use \(70\%\) of the data for training, and we will test the model’s performance on the remaining \(30\%\).

All the aforementioned steps are assembled in the predict_expr wrapper function:

# Set seed for reproducible results
set.seed(22)
# Perform predictions using methylation profiles
pred_profile <- predict_expr(prof_obj = fit_profiles, expr = expr,
                             anno = met_region$anno, model_name = "svm",
                             fit_feature = "RMSE", cov_feature = TRUE, 
                             is_summary = FALSE)
# Perform predictions using mean methylation rate
pred_mean <- predict_expr(prof_obj = fit_mean, expr = expr, 
                          anno = met_region$anno, model_name = "svm",
                          is_summary = FALSE)

We can now compare the Pearson’s correlation coefficient \(r\) for both models and observe that the higher-order methylation features achieve test correlations twice as large compared to mean methylation rate. Note that someone should use cross-validaion to get a better estimate of the correlation performance.

print(paste("Profile r: ", round(pred_profile$test_errors$pcc, 2)))

## [1] "Profile r:  0.66"

print(paste("Mean r:", round(pred_mean$test_errors$pcc, 2)))

## [1] "Mean r: 0.44"

Below in Figure 4 we show a scatter plot of the predicted and measured expression values for the and of the K562 cell line.

g1 <- plot_predicted_expr(pred_obj = pred_profile, 
                          title = "Methylation profile")
g2 <- plot_predicted_expr(pred_obj = pred_mean, 
                          title = "Methylation rate")
g <- cowplot::plot_grid(g1, g2, ncol = 2, nrow = 1)
print(g)

Relationship between DNA methylation patterns and gene expression. Scatter plots of predicted versus measured (log2-transformed) gene expression values using profiles (left) and rates (right); each shaded blue dot represents a different gene, The red line indicates the linear fit between the predicted and measured expression values.

Figure 4: Relationship between DNA methylation patterns and gene expression
Scatter plots of predicted versus measured (log2-transformed) gene expression values using profiles (left) and rates (right); each shaded blue dot represents a different gene, The red line indicates the linear fit between the predicted and measured expression values.

4.5.2 Clustering methylation profiles

Another application of the BPRMeth model is to use the higher-order methylation features to cluster DNA methylation patterns across genomic regions and examine whether distinct methylation profiles are associated to different gene expression levels. To cluster methylation profiles, a mixture modelling approach is considered and we perform inference either via EM algorithm (MLE estimates) or using the mean-field variational inference (Variational Bayes VB) model. In this example we will use the VB model to perform variational inference.

The BPRMeth package provides the cluster_profiles_vb function for performing Bayesian clustering, where the user needs to provide the number of clusters \(K\), the methylation regions \(X\), the observation model and a basis object. For efficiency we run on 20 VB iterations.

# Set seed for reproducible results
set.seed(12) 
# Create basis object
basis_obj <- create_rbf_object(M = 3)
# Set number of clusters K
K <- 4
# Perform clustering
cl_obj <- cluster_profiles_vb(X = met_region$met, K = K, model = "binomial", 
                              alpha_0 = .5, beta_0 = .1,
                              basis = basis_obj, vb_max_iter = 20)

Figure 5 shows the fitted methylation profiles for each cluster.

cluster_plot <- plot_cluster_profiles(cluster_obj = cl_obj)
print(cluster_plot)

Figure 5: Clustering methylation profiles across promoter-proximal regions

4.6 Beta methylation profiles

The structure of the package makes it straighforward to work with methylation data generated from different technologies. For example, with methylation array data, the most common approach is to use the Beta distribution to capture the methylation level from the array experiment. Below we provide a minimal analysis step for inferring methylation profiles on synthetic data; note that we just need to define that model = beta. (In some cases, researchers work with M-values, where they transform the Beta values to \((-\inf, \inf)\), i.e. they make them Gaussian random variables, if that is the case the BPRMeth package can handle this by setting model = gaussian.) Note that for the Beta model, we currently provide only maximum likelihood estimation (MLE).

basis_obj <- create_rbf_object(M = 5)
# Infer beta profiles
beta_prof <- infer_profiles_mle(X = beta_data, model = "beta",
                                basis = basis_obj, is_parallel = FALSE)
p <- plot_infer_profiles(region = 15, obj_prof = beta_prof,
                         obs = beta_data, title = "Beta profile")
print(p)

4.7 Bernoulli methylation profiles

Similarly, someone might analyse single cell methylation data, where now would need to set model = bernoulli as shown in the code snippet below using again synthetic data.

basis_prof <- create_rbf_object(M = 5)
basis_mean <- create_rbf_object(M = 0)
bern_prof <- infer_profiles_vb(X = bernoulli_data, model = "bernoulli",
                               basis = basis_prof, is_parallel = FALSE)
bern_mean <- infer_profiles_vb(X = bernoulli_data, model = "bernoulli",
                               basis = basis_mean, is_parallel = FALSE)
p <- plot_infer_profiles(region = 50, obj_prof = bern_prof, 
                         obj_mean = bern_mean, obs = bernoulli_data, 
                         title = "Bernoulli profile")
print(p)

We can finally cluster the single-cell methylation profiles as follows

# Perform clustering
cl_obj <- cluster_profiles_vb(X = bernoulli_data, K = 3, model = "bernoulli", 
                              basis = basis_obj, vb_max_iter = 40)
                              
cluster_plot <- plot_cluster_profiles(cluster_obj = cl_obj)
print(cluster_plot)

(#fig:cluster_bernoulli)Clustering Bernoulli methylation profiles.

5 Session Info

This vignette was compiled using:

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] BPRMeth_1.10.0       GenomicRanges_1.36.0 GenomeInfoDb_1.20.0 
## [4] IRanges_2.18.0       S4Vectors_0.22.0     BiocGenerics_0.30.0 
## [7] BiocStyle_2.12.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1             RColorBrewer_1.1-2     pillar_1.3.1          
##  [4] compiler_3.6.0         BiocManager_1.30.4     highr_0.8             
##  [7] plyr_1.8.4             XVector_0.24.0         class_7.3-15          
## [10] iterators_1.0.10       bitops_1.0-6           tools_3.6.0           
## [13] zlibbioc_1.30.0        digest_0.6.18          tibble_2.1.1          
## [16] evaluate_0.13          gtable_0.3.0           pkgconfig_2.0.2       
## [19] png_0.1-7              rlang_0.3.4            foreach_1.4.4         
## [22] yaml_2.2.0             xfun_0.6               e1071_1.7-1           
## [25] GenomeInfoDbData_1.2.1 dplyr_0.8.0.1          stringr_1.4.0         
## [28] knitr_1.22             cowplot_0.9.4          tidyselect_0.2.5      
## [31] grid_3.6.0             data.table_1.12.2      glue_1.3.1            
## [34] R6_2.4.0               rmarkdown_1.12         bookdown_0.9          
## [37] purrr_0.3.2            ggplot2_3.1.1          magrittr_1.5          
## [40] codetools_0.2-16       matrixcalc_1.0-3       scales_1.0.0          
## [43] htmltools_0.3.6        assertthat_0.2.1       colorspace_1.4-1      
## [46] labeling_0.3           stringi_1.4.3          RCurl_1.95-4.12       
## [49] lazyeval_0.2.2         munsell_0.5.0          crayon_1.3.4

6 Bibliography

[1] Kapourani, C. A., & Sanguinetti, G. (2016). Higher order methylation features for clustering and prediction in epigenomic studies. Bioinformatics, 32(17), i405-i412.

[2] Kapourani, C. A. and Sanguinetti, G. (2018). BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics, DOI: https://doi.org/10.1093/bioinformatics/bty129

[3] Clark, S.J., Argelaguet, R., Kapourani, C.A., Stubbs, T.M., Lee, H.J., Alda-Catalinas, C., Krueger, F., Sanguinetti, G., Kelsey, G., Marioni, J.C., Stegle, O. and Reik, W., (2018). scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nature communications, 9(1), p.781.

[4] ENCODE Project Consortium. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57-74.

7 Acknowledgements

This package was developed at the University of Edinburgh in the School of Informatics, with support from Guido Sanguinetti.

This study was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh, and by the European Research Council through grant MLCS306999.