Basic Usage

Estimating Transcription Rates from Experimental Data

First estimate transcription rates from an experiment’s BigWig files containing PRO-seq read counts without steric hindrance. Uses Expectation Maximization algorithm to estimate transcription rates such as gene body RNAP density (chi) and the ratio of gene body RNAP density to pause region RNAP density under models of fixed pause sites (betaOrg) and variable pause sites (betaAdp).

The example data used throughout this package is derived from Aoi et al. (2020). The data is from the raw PRO-Seq read counts for human cells treated with auxin-induced acute depletion of NELF(negative elongation factor). Raw PRO-seq data is located at GEO: GSE144786. The raw fastq files were processed with https://github.com/Danko-Lab/proseq2.0.

Estimates of transcription rates, such as chi and beta, are calculated given Expectation Maximization with likelihood formulas derived in Stationary Distribution and Inference at Steady-State section of Siepel (2022) and stated in Inference at Steady State section of Zhao et al. (2023). Note that chi is the estimator for the average read depth, excluding the pause peak and beta is the estimator for the ratio of chi to the read depth in the pause peak region. The estimator for beta is also given for a model adapted for varying pause sites across cells and is denoted by variable betaAdp whereas the fixed pause site estimate is denoted by betaOrg, refer to Allowing for variation in pause site section of Zhao et al. (2023). Pause site statistics such as the mean position of pause sites and the variance of pause sites (fkMean and fkVar) are also reported.

load(system.file("extdata", "granges_for_read_counting_DLD1_chr21.RData", 
package = "STADyUM"))

controlRates <- estimateTranscriptionRates(system.file("extdata",
"PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_plus_chr21.bw", package = "STADyUM"),
system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_minus_chr21.bw",
package = "STADyUM"),
    bw_pause_filtered, bw_gb_filtered, "Control")
#> 
#> Importing bigwig files...
#> 
#> Processing plus and minus strands bigwig...
#> 
#> Summarizing pause and gene body regions...
#> 
#> Generating read counts table...
#> estimating rates...

show(controlRates)
#> An ExperimentTranscriptionRates object with:
#>   - 85 genes
#>   - 85 rate estimates
#>   - Steric hindrance: FALSE 
#> 
#> Summary statistics for rate estimates across all genes/features:
#>   - chi (gene body RNAP density): 0.04 RNAPs/bp
#>   - betaOrg (ratio of gene body RNAP density to pause region
#>     RNAP density, fixed sites): 0.0459
#>   - betaAdp (ratio of gene body RNAP density to pause region
#>     RNAP density, adapted model): 0.0002
#>   - fkMean (mean position of pause sites): ~ 85 bp
#>   - fkVar (variance of pause site positions): 2648.42 bp^2
#> 
#> To access the full rates data, use: rates(object)

treatedRates <- estimateTranscriptionRates(system.file("extdata",
"PROseq-DLD1-aoi-NELFC_Auxin-SE_plus_chr21.bw", package = "STADyUM"),
system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin-SE_minus_chr21.bw", 
package = "STADyUM"),
    bw_pause_filtered, bw_gb_filtered, "Auxin Treated")
#> 
#> Importing bigwig files...
#> 
#> Processing plus and minus strands bigwig...
#> 
#> Summarizing pause and gene body regions...
#> 
#> Generating read counts table...
#> estimating rates...

show(treatedRates)
#> An ExperimentTranscriptionRates object with:
#>   - 80 genes
#>   - 80 rate estimates
#>   - Steric hindrance: FALSE 
#> 
#> Summary statistics for rate estimates across all genes/features:
#>   - chi (gene body RNAP density): 0.02 RNAPs/bp
#>   - betaOrg (ratio of gene body RNAP density to pause region
#>     RNAP density, fixed sites): 0.0312
#>   - betaAdp (ratio of gene body RNAP density to pause region
#>     RNAP density, adapted model): 0.0001
#>   - fkMean (mean position of pause sites): ~ 129 bp
#>   - fkVar (variance of pause site positions): 3237.82 bp^2
#> 
#> To access the full rates data, use: rates(object)

controlRatesTbl <- rates(controlRates)

head(controlRatesTbl)
#> # A tibble: 6 × 11
#>   geneId              chi betaOrg   betaAdp fkMean fkVar totalGbRc gbLength
#>   <chr>             <dbl>   <dbl>     <dbl>  <dbl> <dbl>     <dbl>    <int>
#> 1 ENSG00000141956 0.0239  0.0839  0.000268   133.  3905.      1879    78612
#> 2 ENSG00000141959 0.0626  0.0491  0.000265    50.3  239.      1551    24788
#> 3 ENSG00000142166 0.0240  0.0351  0.000159    92.7 3342.       778    32387
#> 4 ENSG00000142168 0.0897  0.0262  0.000134    81.8 1289.       605     6745
#> 5 ENSG00000142185 0.0392  0.0315  0.000184    36.3  215.       584    14882
#> 6 ENSG00000142188 0.00892 0.00522 0.0000243  111.  2909.       401    44949
#> # ℹ 3 more variables: expectedPauseSiteCounts <named list>,
#> #   actualPauseSiteCounts <named list>, likelihood <dbl>

Plotted below is a density plot of gene-body RNAP density values of all the genes and a parity plot for verifying the model fit. The gene-body RNAP density (chi) is an estimate of the total gene-body reads / gene-body length in RNAPs per bp. This plot indicates that most genes have low RNAP density and only a minority of the genes have higher transcriptional activity. The parity plot of expected vs. actual pause site counts per gene can be used to verify that the model fit the actual data well. In this case the Pearson correlation coefficient is 0.99 indicating the model fit is strong.

Plotting functionality for TranscriptionRates objects

plotChiDistrib(controlRates, file="ChiDistributionPlot.png")


plotExpectedVsActualPauseSiteCounts(treatedRates, 
                                    file="treatedExpActPauseSites.png")

Estimate Transcription Rates for Experimental Data under Steric Hindrance

Under the model of steric hindrance, additional rates are estimated such as the landing pad occupancy (phi), pause-escape rate (betaZeta), the potential initiation rate (alphaZeta), and omegaZeta (effective initiation rate). The likelihood functions differ under the model of steric hindrance and are derived in Fitting the Steric Hindrance Model to the Data of Zhao et al. (2023). Note the omega parameter is a scaling factor.

controlRatesSH <- estimateTranscriptionRates(system.file("extdata",
"PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_plus_chr21.bw", package = "STADyUM"),
    system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_minus_chr21.bw",
    package = "STADyUM"),
    bw_pause_filtered, bw_gb_filtered, "Control", stericHindrance=TRUE, 
    omegaScale=12.3768278981277)

show(controlRatesSH)

treatedRatesSH <- estimateTranscriptionRates(
    system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin-SE_plus_chr21.bw", 
    package = "STADyUM"),
    system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin-SE_minus_chr21.bw",
    package = "STADyUM"),
    bw_pause_filtered, bw_gb_filtered, "Auxin Treated", stericHindrance=TRUE, 
    omegaScale=11.0318379571379)

show(treatedRatesSH)

Likelihood Ratio Tests

Compare samples treated with auxin to control samples using the likelihood ratio test functionality derived in Likelihood Ratio Tests for Differences Between Transcription Units, Conditions, or Species section of Siepel (2022)

lrt <- likelihoodRatioTest(controlRates, treatedRates)

show(lrt)
#> An object of class "TranscriptionRatesLRT"
#> Slot "transcriptionRates1":
#> An ExperimentTranscriptionRates object with:
#>   - 85 genes
#>   - 85 rate estimates
#>   - Steric hindrance: FALSE 
#> 
#> Summary statistics for rate estimates across all genes/features:
#>   - chi (gene body RNAP density): 0.04 RNAPs/bp
#>   - betaOrg (ratio of gene body RNAP density to pause region
#>     RNAP density, fixed sites): 0.0459
#>   - betaAdp (ratio of gene body RNAP density to pause region
#>     RNAP density, adapted model): 0.0002
#>   - fkMean (mean position of pause sites): ~ 85 bp
#>   - fkVar (variance of pause site positions): 2648.42 bp^2
#> 
#> To access the full rates data, use: rates(object)
#> 
#> Slot "transcriptionRates2":
#> An ExperimentTranscriptionRates object with:
#>   - 80 genes
#>   - 80 rate estimates
#>   - Steric hindrance: FALSE 
#> 
#> Summary statistics for rate estimates across all genes/features:
#>   - chi (gene body RNAP density): 0.02 RNAPs/bp
#>   - betaOrg (ratio of gene body RNAP density to pause region
#>     RNAP density, fixed sites): 0.0312
#>   - betaAdp (ratio of gene body RNAP density to pause region
#>     RNAP density, adapted model): 0.0001
#>   - fkMean (mean position of pause sites): ~ 129 bp
#>   - fkVar (variance of pause site positions): 3237.82 bp^2
#> 
#> To access the full rates data, use: rates(object)
#> 
#> Slot "spikeInFile":
#> NULL
#> 
#> Slot "chiTbl":
#> # A tibble: 80 × 7
#>    geneId             chi1    chi2    lfc tStats         p      padj
#>    <chr>             <dbl>   <dbl>  <dbl>  <dbl>     <dbl>     <dbl>
#>  1 ENSG00000141956 0.0239  0.0107  -1.16   203.  2.29e- 90 2.04e- 89
#>  2 ENSG00000141959 0.0626  0.0250  -1.33   207.  6.02e- 92 6.02e- 91
#>  3 ENSG00000142166 0.0240  0.0122  -0.982   64.1 1.03e- 29 2.17e- 29
#>  4 ENSG00000142168 0.0897  0.0455  -0.979   49.6 2.30e- 23 4.18e- 23
#>  5 ENSG00000142185 0.0392  0.0177  -1.15    62.4 5.78e- 29 1.19e- 28
#>  6 ENSG00000142188 0.00892 0.00587 -0.603   14.2 9.73e-  8 1.20e-  7
#>  7 ENSG00000142192 0.0495  0.0426  -0.218   23.7 5.97e- 12 8.53e- 12
#>  8 ENSG00000142197 0.0105  0.00469 -1.16   103.  1.47e- 46 4.51e- 46
#>  9 ENSG00000142207 0.0568  0.0295  -0.947  351.  1.64e-154 6.57e-153
#> 10 ENSG00000154639 0.0233  0.0134  -0.803  107.  1.97e- 48 6.56e- 48
#> # ℹ 70 more rows
#> 
#> Slot "betaTbl":
#> # A tibble: 80 × 11
#>    geneId   beta1   beta2     lfc fkMean1 fkMean2 fkVar1 fkVar2  tStats        p
#>    <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>    <dbl>
#>  1 ENSG0… 2.68e-4 1.79e-4 -0.580    133.    142.   3905.  3779. 2.81e+0 1.77e- 2
#>  2 ENSG0… 2.65e-4 9.00e-5 -1.56      50.3   104.    239.  4039. 8.53e+1 5.58e-39
#>  3 ENSG0… 1.59e-4 6.10e-5 -1.38      92.7   142.   3342.  3116. 4.04e+1 2.38e-19
#>  4 ENSG0… 1.34e-4 5.08e-5 -1.40      81.8    99.0  1289.  1113. 1.80e+2 1.84e-80
#>  5 ENSG0… 1.84e-4 2.57e-4  0.481     36.3   120.    215.  5191. 9.87e-1 1.60e- 1
#>  6 ENSG0… 2.43e-5 1.84e-5 -0.397    111.    146.   2909.  3275. 6.41e+0 3.45e- 4
#>  7 ENSG0… 1.62e-4 1.43e-4 -0.186     76.1   151.   2719.  4207. 3.21e+0 1.13e- 2
#>  8 ENSG0… 4.12e-5 4.02e-5 -0.0365    76.9   142.   2620.  4717. 3.73e-2 7.85e- 1
#>  9 ENSG0… 1.47e-4 6.10e-5 -1.27      94.6   119.   2790.  2115. 8.38e+1 2.51e-38
#> 10 ENSG0… 1.59e-4 5.98e-5 -1.41     103.    142.   3851.  3592. 5.24e+1 1.40e-24
#> # ℹ 70 more rows
#> # ℹ 1 more variable: padj <dbl>

Plot Results from Likelihood Ratio Tests

As expected given current literature, RNAPs tend to pause more downstream when NELF is absent due to Auxin treatment. As shown in the plot, the majority of treated genes have average pause locations of more than 100bp downstream of the TSS. Strikingly, few RNAPs could enter productive elongation after NELF depletion and treated samples had a signficantly lower pause-escape rate than the control samples exemplified by the beta violin plots.


plotMeanPauseDistrib(controlRates, file="meanPauseSitesControl.png")

plotMeanPauseDistrib(treatedRates, file="meanPauseSitesTreated.png")


BetaViolinPlot(lrt, file="lrtBetaViolinPlot.png")

Simulate Polymerase

(A) Conceptual illustration of model focusing on the kinetic model for RNAP movement on the DNA template. (B) Graphical model representation with unobserved continuous-time Markov chain (Z) and observed read counts (X). Read counts at each site are conditionally independent and Poisson distributed. (C) Design of SimPol simulator tracking the movement of in silico RNAPs across N-bp DNA templates in C cells, then samples the synthetic read counts based on RNAP positions. (D) Example of synthetic nascent RNA sequencing data from SimPol alongside matched real PRO-seq data from the DNAJA1 gene on chromosome 9 of the human genome

Simulate read counts by running simulations of polymerase movement along the gene at user-specified rates and gene lengths. Our simulator models the movement of RNAPs along a set of transcription units given a continuous-time Markov model. Nascent RNA sequencing read counts are then generated under steady-state conditions by sampling based on a Poisson distribution with mean equal to the simulated RNAP frequency multiplied by a user-specified sequencing depth. A graphical model representation (Figure B) with unobserved continuous-time Markov chain (Z_i) in which user-specified paramters for transcription initiation (alpha), pause-escape, and elongation (zeta) are included as well as the observed read counts (Xi). The initial model is extended to allow for variability across cells in promoter-proximal pause site locations and steric hindrance of transcription initiation from paused RNAPs (Figure C). More information on the simulator can be found in the sections A simple probabilistic model for transcription initiation, promoter-proximal pausing, and elongation, SimPol Simulator, Generation of synthetic NRS data of Zhao et al. (2023)

simpol <- simulatePolymerase(k=50, ksd=25, kMin=17, kMax=200, geneLen=1950,
alpha=2, beta=0.5, zeta=2000, zetaSd=1000, zetaMin=1500, zetaMax=2500,
cellNum=100, polSize=33, addSpace=17, time=39, timesToRecord=NULL)
#> Total time used for the simulation is 0.0166667 mins.

show(simpol)
#> A SimulatePolymerase object with:
#>   - gene length = 1950 
#>   - number of cells = 100 
#>   - pause sites mean = 50.33 
#>   - pause sites std dev = 20.44 
#>   - top 10 most occupied sites across all cells:
#>     position    0: 100 polymerases
#>     position   47: 6 polymerases
#>     position   21: 5 polymerases
#>     position   25: 5 polymerases
#>     position   39: 5 polymerases
#>     position   75: 4 polymerases
#>     position    2: 3 polymerases
#>     position   18: 3 polymerases
#>     position   23: 3 polymerases
#>     position   48: 3 polymerases
#>   - gene body average read counts = 0.05 
#>   - pause region average read counts = 2.5 
#>   - percent of nucleotides with zero reads = 94.8 %
#> 
#> To access the full simulation parameters, use: parameters(object)
#> 
#> To access the all sampled read counts, use: readCounts(object)

readCounts <- readCounts(simpol)

plotCombinedCells(simpol, file="combinedCellsLollipopPlot.png")

This uses the same core Expectation Maximization algorithm as the experimental data, but it is able to handle the simulated data as well.

Estimate Transcription Rates from Simulated Data

simRates <- estimateTranscriptionRates(simpol, name="simRatesb0.5k50",
stericHindrance=TRUE)
#> Starting EM algorithm...

show(simRates)
#> A SimulationTranscriptionRates object with:
#>   - Steric hindrance: TRUE 
#>   - Number of Trials with 5000 cells: 50 
#> 
#> Summary statistics for rate estimates:
#>   - chi (gene body RNAP density): 0.25 RNAPs/bp
#>   - betaOrg (ratio of gene body RNAP density to pause region
#>         RNAP density, fixed sites): 0.4834
#>   - betaAdp (ratio of gene body RNAP density to pause region
#>         RNAP density, adapted model): 0.0033
#>   - TSS Read Counts Averaged Across Trials: ~ 104 read
#>         counts
#>   - Gene Body Read Counts Averaged Across Trials: ~ 436 read
#>         counts
#>   - Landing Pad Read Counts Averaged Across Trials: ~ 4151
#>         read counts
#>   - phi (landing pad occupancy): 0.62
#> 
#> EM Algorithm Convergence:
#>   - Trials converged normally: 50/50 (100.0%)
#>   - Trials converged to single site: 0/50 (0.0%)
#>   - Trials reached max iterations without convergence: 0/50
#>     (0.0%)

Likelihood Ratio Tests for Simulated Data

Likelihood ratio tests can also be applied to simulated data. In this example, the simulated rates for the first simpol object is compared to the simulated rates from the second simpol object. As expected given the simpol parameters, the plots show that the second simulated rates have higher beta and higher mean pause sites


simpol2 <- simulatePolymerase(k=100, ksd=25, kMin=75, kMax=200, geneLen=1950,
alpha=2, beta=10, zeta=2000, zetaSd=1000, zetaMin=1500, zetaMax=2500,
cellNum=100, polSize=33, addSpace=17, time=30, timesToRecord=NULL)

simRates2 <- estimateTranscriptionRates(simpol2, name="simRatesb10k100",
stericHindrance=TRUE)

simLRT <- likelihoodRatioTest(simRates, simRates2)

plotPauseSiteContourMapTwoConditions(simLRT, file="simPauseSitesContourMap.png")
BetaViolinPlot(simLRT, file="simBetaViolinPlot.png")

Session Info

sessionInfo()
#> R Under development (unstable) (2025-10-20 r88955)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] STADyUM_1.1.3
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.41.0 gtable_0.3.6               
#>  [3] rjson_0.2.23                xfun_0.54                  
#>  [5] bslib_0.9.0                 ggplot2_4.0.1              
#>  [7] Biobase_2.71.0              lattice_0.22-7             
#>  [9] vctrs_0.6.5                 tools_4.6.0                
#> [11] bitops_1.0-9                generics_0.1.4             
#> [13] stats4_4.6.0                curl_7.0.0                 
#> [15] parallel_4.6.0              tibble_3.3.0               
#> [17] pkgconfig_2.0.3             Matrix_1.7-4               
#> [19] data.table_1.17.8           RColorBrewer_1.1-3         
#> [21] cigarillo_1.1.0             S7_0.2.1                   
#> [23] S4Vectors_0.49.0            lifecycle_1.0.4            
#> [25] compiler_4.6.0              farver_2.1.2               
#> [27] textshaping_1.0.4           progress_1.2.3             
#> [29] Rsamtools_2.27.0            Biostrings_2.79.2          
#> [31] Seqinfo_1.1.0               codetools_0.2-20           
#> [33] GenomeInfoDb_1.47.2         htmltools_0.5.9            
#> [35] sass_0.4.10                 RCurl_1.98-1.17            
#> [37] yaml_2.3.11                 tidyr_1.3.1                
#> [39] pillar_1.11.1               crayon_1.5.3               
#> [41] jquerylib_0.1.4             MASS_7.3-65                
#> [43] BiocParallel_1.45.0         cachem_1.1.0               
#> [45] DelayedArray_0.37.0         abind_1.4-8                
#> [47] tidyselect_1.2.1            digest_0.6.39              
#> [49] dplyr_1.1.4                 purrr_1.2.0                
#> [51] restfulr_0.0.16             labeling_0.4.3             
#> [53] fastmap_1.2.0               grid_4.6.0                 
#> [55] SparseArray_1.11.8          cli_3.6.5                  
#> [57] magrittr_2.0.4              S4Arrays_1.11.1            
#> [59] utf8_1.2.6                  dichromat_2.0-0.1          
#> [61] XML_3.99-0.20               withr_3.0.2                
#> [63] prettyunits_1.2.0           scales_1.4.0               
#> [65] UCSC.utils_1.7.0            rmarkdown_2.30             
#> [67] XVector_0.51.0              httr_1.4.7                 
#> [69] matrixStats_1.5.0           ragg_1.5.0                 
#> [71] hms_1.1.4                   evaluate_1.0.5             
#> [73] knitr_1.50                  GenomicRanges_1.63.0       
#> [75] IRanges_2.45.0              BiocIO_1.21.0              
#> [77] rtracklayer_1.71.0          rlang_1.1.6                
#> [79] Rcpp_1.1.0                  glue_1.8.0                 
#> [81] BiocGenerics_0.57.0         jsonlite_2.0.0             
#> [83] R6_2.6.1                    systemfonts_1.3.1          
#> [85] MatrixGenerics_1.23.0       GenomicAlignments_1.47.0

Aoi, Yuki, Edwin R. Smith, Avani P. Shah, Emily J. Rendleman, Stacy A. Marshall, Ashley R. Woodfin, Fei X. Chen, Ramin Shiekhattar, and Ali Shilatifard. 2020. “NELF Regulates a Promoter-Proximal Step Distinct from Rna Pol Ii Pause-Release.” Molecular Cell 78 (2): 261–274.e5. https://doi.org/10.1016/j.molcel.2020.02.014.

Siepel, Adam. 2022. “A Unified Probabilistic Modeling Framework for Eukaryotic Transcription Based on Nascent Rna Sequencing Data.” bioRxiv. https://doi.org/10.1101/2021.01.12.426408.

Zhao, Yixin, Lingjie Liu, Rebecca Hassett, and Adam Siepel. 2023. “Model-Based Characterization of the Equilibrium Dynamics of Transcription Initiation and Promoter-Proximal Pausing in Human Cells.” Nucleic Acids Research 51 (21): e106–e106. https://doi.org/10.1093/nar/gkad843.

STADyUM: Simulating and Analyzing Transcription Dynamics

Rebecca Hassett

2025-12-04

Introduction

Installation