--- title: "STADyUM: Simulating and Analyzing Transcription Dynamics" author: "Rebecca Hassett" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{STADyUM: Simulating and Analyzing Transcription Dynamics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r load_package, include = FALSE} # Load the package library(STADyUM) ``` # Introduction STADyUM is an R package for simulating and analyzing transcription dynamics. It provides tools for: - Simulating polymerase movement along genes - Estimating transcription initiation, pause-release, and elongation rates - Analyzing pause site distributions and RNAP density profiles - Visualizing transcription dynamics across genomic regions - Comparing transcription-associated rates under different conditions STADyUM seamlessly integrates with the Bioconductor ecosystem by leveraging standard Bioconductor data structures such as GRanges for genomic interval handling and S4Vectors for custom Bioconductor-style data structures. This compatibility allows users to incorporate STADyUM's results directly into downstream analyses and visualization pipelines with other Bioconductor tools. While existing Bioconductor packages like INSPEcT focuses primarily on modeling RNA synthesis, processing, and degradation using both nascent and mature RNA-seq data through systems of differential equations, STADyUM takes a different modeling approach. STADyUM provides a probabilistic framework tailored specifically for nascent RNA data alone (such as PRO-seq, GRO-seq, 4sU-DRB-seq) including functionality for modeling steric hindrance between RNA polymerases and modeling pause-site kinetics explicitly. # Installation ```{r install, eval = FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("STADyUM") ``` # Basic Usage ## Estimating Transcription Rates from Experimental Data First estimate transcription rates from an experiment's BigWig files containing PRO-seq read counts without steric hindrance. Uses Expectation Maximization algorithm to estimate transcription rates such as gene body RNAP density (chi) and the ratio of gene body RNAP density to pause region RNAP density under models of fixed pause sites (betaOrg) and variable pause sites (betaAdp). The example data used throughout this package is derived from @Aoi2020. The data is from the raw PRO-Seq read counts for human cells treated with auxin-induced acute depletion of NELF(negative elongation factor). Raw PRO-seq data is located at GEO: GSE144786. The raw fastq files were processed with https://github.com/Danko-Lab/proseq2.0. Estimates of transcription rates, such as chi and beta, are calculated given Expectation Maximization with likelihood formulas derived in *Stationary Distribution and Inference at Steady-State* section of @Siepel2022 and stated in *Inference at Steady State* section of @Zhao2023. Note that chi is the estimator for the average read depth, excluding the pause peak and beta is the estimator for the ratio of chi to the read depth in the pause peak region. The estimator for beta is also given for a model adapted for varying pause sites across cells and is denoted by variable betaAdp whereas the fixed pause site estimate is denoted by betaOrg, refer to *Allowing for variation in pause site* section of @Zhao2023. Pause site statistics such as the mean position of pause sites and the variance of pause sites (fkMean and fkVar) are also reported. ```{r estimate_experiment_rates} load(system.file("extdata", "granges_for_read_counting_DLD1_chr21.RData", package = "STADyUM")) controlRates <- estimateTranscriptionRates(system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_plus_chr21.bw", package = "STADyUM"), system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_minus_chr21.bw", package = "STADyUM"), bw_pause_filtered, bw_gb_filtered, "Control") show(controlRates) treatedRates <- estimateTranscriptionRates(system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin-SE_plus_chr21.bw", package = "STADyUM"), system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin-SE_minus_chr21.bw", package = "STADyUM"), bw_pause_filtered, bw_gb_filtered, "Auxin Treated") show(treatedRates) controlRatesTbl <- rates(controlRates) head(controlRatesTbl) ``` Plotted below is a density plot of gene-body RNAP density values of all the genes and a parity plot for verifying the model fit. The gene-body RNAP density (chi) is an estimate of the total gene-body reads / gene-body length in RNAPs per bp. This plot indicates that most genes have low RNAP density and only a minority of the genes have higher transcriptional activity. The parity plot of expected vs. actual pause site counts per gene can be used to verify that the model fit the actual data well. In this case the Pearson correlation coefficient is 0.99 indicating the model fit is strong. ## Plotting functionality for TranscriptionRates objects ```{r plot_exp_rates, fig.width=7, fig.height=5, out.width="75%"} plotChiDistrib(controlRates, file="ChiDistributionPlot.png") plotExpectedVsActualPauseSiteCounts(treatedRates, file="treatedExpActPauseSites.png") ``` ## Estimate Transcription Rates for Experimental Data under Steric Hindrance Under the model of steric hindrance, additional rates are estimated such as the landing pad occupancy (phi), pause-escape rate (betaZeta), the potential initiation rate (alphaZeta), and omegaZeta (effective initiation rate). The likelihood functions differ under the model of steric hindrance and are derived in *Fitting the Steric Hindrance Model to the Data* of @Zhao2023. Note the omega parameter is a scaling factor. ```{r estimate_experiment_rates_steric_hindrance, eval=FALSE} controlRatesSH <- estimateTranscriptionRates(system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_plus_chr21.bw", package = "STADyUM"), system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_minus_chr21.bw", package = "STADyUM"), bw_pause_filtered, bw_gb_filtered, "Control", stericHindrance=TRUE, omegaScale=12.3768278981277) show(controlRatesSH) treatedRatesSH <- estimateTranscriptionRates( system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin-SE_plus_chr21.bw", package = "STADyUM"), system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin-SE_minus_chr21.bw", package = "STADyUM"), bw_pause_filtered, bw_gb_filtered, "Auxin Treated", stericHindrance=TRUE, omegaScale=11.0318379571379) show(treatedRatesSH) ``` ## Likelihood Ratio Tests Compare samples treated with auxin to control samples using the likelihood ratio test functionality derived in *Likelihood Ratio Tests for Differences Between Transcription Units, Conditions, or Species* section of @Siepel2022 ```{r likelihood_ratio_tests} lrt <- likelihoodRatioTest(controlRates, treatedRates) show(lrt) ``` ## Plot Results from Likelihood Ratio Tests As expected given current literature, RNAPs tend to pause more downstream when NELF is absent due to Auxin treatment. As shown in the plot, the majority of treated genes have average pause locations of more than 100bp downstream of the TSS. Strikingly, few RNAPs could enter productive elongation after NELF depletion and treated samples had a signficantly lower pause-escape rate than the control samples exemplified by the beta violin plots. ```{r plot_lrt_results,fig.width=7,fig.height=5,out.width="75%"} plotMeanPauseDistrib(controlRates, file="meanPauseSitesControl.png") plotMeanPauseDistrib(treatedRates, file="meanPauseSitesTreated.png") BetaViolinPlot(lrt, file="lrtBetaViolinPlot.png") ``` ## Simulate Polymerase ![(A) Conceptual illustration of model focusing on the kinetic model for RNAP movement on the DNA template. (B) Graphical model representation with unobserved continuous-time Markov chain (Z) and observed read counts (X). Read counts at each site are conditionally independent and Poisson distributed. (C) Design of SimPol simulator tracking the movement of in silico RNAPs across N-bp DNA templates in C cells, then samples the synthetic read counts based on RNAP positions. (D) Example of synthetic nascent RNA sequencing data from SimPol alongside matched real PRO-seq data from the DNAJA1 gene on chromosome 9 of the human genome](simpol_figures.png){width=75%} Simulate read counts by running simulations of polymerase movement along the gene at user-specified rates and gene lengths. Our simulator models the movement of RNAPs along a set of transcription units given a continuous-time Markov model. Nascent RNA sequencing read counts are then generated under steady-state conditions by sampling based on a Poisson distribution with mean equal to the simulated RNAP frequency multiplied by a user-specified sequencing depth. A graphical model representation (Figure B) with unobserved continuous-time Markov chain (Z_i) in which user-specified paramters for transcription initiation (alpha), pause-escape, and elongation (zeta) are included as well as the observed read counts (Xi). The initial model is extended to allow for variability across cells in promoter-proximal pause site locations and steric hindrance of transcription initiation from paused RNAPs (Figure C). More information on the simulator can be found in the sections *A simple probabilistic model for transcription initiation, promoter-proximal pausing, and elongation*, *SimPol Simulator*, *Generation of synthetic NRS data* of @Zhao2023 ```{r create_simpol, fig.width=7,fig.height=5,out.width="70%",out.height="50%"} simpol <- simulatePolymerase(k=50, ksd=25, kMin=17, kMax=200, geneLen=1950, alpha=2, beta=0.5, zeta=2000, zetaSd=1000, zetaMin=1500, zetaMax=2500, cellNum=100, polSize=33, addSpace=17, time=39, timesToRecord=NULL) show(simpol) readCounts <- readCounts(simpol) plotCombinedCells(simpol, file="combinedCellsLollipopPlot.png") ``` This uses the same core Expectation Maximization algorithm as the experimental data, but it is able to handle the simulated data as well. ## Estimate Transcription Rates from Simulated Data ```{r estimate_simulated_rates, fig.width=7, fig.height=5, out.width="75%"} simRates <- estimateTranscriptionRates(simpol, name="simRatesb0.5k50", stericHindrance=TRUE) show(simRates) ``` ## Likelihood Ratio Tests for Simulated Data Likelihood ratio tests can also be applied to simulated data. In this example, the simulated rates for the first simpol object is compared to the simulated rates from the second simpol object. As expected given the simpol parameters, the plots show that the second simulated rates have higher beta and higher mean pause sites ![](simPauseSitesContourMap.png){width=75%} ![](simBetaViolinPlot.png){width=75%} ```{r simulated_lrts, fig.width=7, fig.height=5, out.width="75%", eval=FALSE} simpol2 <- simulatePolymerase(k=100, ksd=25, kMin=75, kMax=200, geneLen=1950, alpha=2, beta=10, zeta=2000, zetaSd=1000, zetaMin=1500, zetaMax=2500, cellNum=100, polSize=33, addSpace=17, time=30, timesToRecord=NULL) simRates2 <- estimateTranscriptionRates(simpol2, name="simRatesb10k100", stericHindrance=TRUE) simLRT <- likelihoodRatioTest(simRates, simRates2) plotPauseSiteContourMapTwoConditions(simLRT, file="simPauseSitesContourMap.png") BetaViolinPlot(simLRT, file="simBetaViolinPlot.png") ``` ## Session Info ```{r session_info} sessionInfo() ```