STADyUM is an R package for simulating and analyzing transcription dynamics. It provides tools for:
STADyUM seamlessly integrates with the Bioconductor ecosystem by leveraging standard Bioconductor data structures such as GRanges for genomic interval handling and S4Vectors for custom Bioconductor-style data structures. This compatibility allows users to incorporate STADyUM’s results directly into downstream analyses and visualization pipelines with other Bioconductor tools.
While existing Bioconductor packages like INSPEcT focuses primarily on modeling RNA synthesis, processing, and degradation using both nascent and mature RNA-seq data through systems of differential equations, STADyUM takes a different modeling approach. STADyUM provides a probabilistic framework tailored specifically for nascent RNA data alone (such as PRO-seq, GRO-seq, 4sU-DRB-seq) including functionality for modeling steric hindrance between RNA polymerases and modeling pause-site kinetics explicitly.
First estimate transcription rates from an experiment’s BigWig files containing PRO-seq read counts without steric hindrance. Uses Expectation Maximization algorithm to estimate transcription rates such as gene body RNAP density (chi) and the ratio of gene body RNAP density to pause region RNAP density under models of fixed pause sites (betaOrg) and variable pause sites (betaAdp).
The example data used throughout this package is derived from Aoi et al. (2020). The data is from the raw PRO-Seq read counts for human cells treated with auxin-induced acute depletion of NELF(negative elongation factor). Raw PRO-seq data is located at GEO: GSE144786. The raw fastq files were processed with https://github.com/Danko-Lab/proseq2.0.
Estimates of transcription rates, such as chi and beta, are calculated given Expectation Maximization with likelihood formulas derived in Stationary Distribution and Inference at Steady-State section of Siepel (2022) and stated in Inference at Steady State section of Zhao et al. (2023). Note that chi is the estimator for the average read depth, excluding the pause peak and beta is the estimator for the ratio of chi to the read depth in the pause peak region. The estimator for beta is also given for a model adapted for varying pause sites across cells and is denoted by variable betaAdp whereas the fixed pause site estimate is denoted by betaOrg, refer to Allowing for variation in pause site section of Zhao et al. (2023). Pause site statistics such as the mean position of pause sites and the variance of pause sites (fkMean and fkVar) are also reported.
load(system.file("extdata", "granges_for_read_counting_DLD1_chr21.RData",
package = "STADyUM"))
controlRates <- estimateTranscriptionRates(system.file("extdata",
"PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_plus_chr21.bw", package = "STADyUM"),
system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_minus_chr21.bw",
package = "STADyUM"),
bw_pause_filtered, bw_gb_filtered, "Control")
#>
#> Importing bigwig files...
#>
#> Processing plus and minus strands bigwig...
#>
#> Summarizing pause and gene body regions...
#>
#> Generating read counts table...
#> estimating rates...
show(controlRates)
#> An ExperimentTranscriptionRates object with:
#> - 85 genes
#> - 85 rate estimates
#> - Steric hindrance: FALSE
#>
#> Summary statistics for rate estimates across all genes/features:
#> - chi (gene body RNAP density): 0.04 RNAPs/bp
#> - betaOrg (ratio of gene body RNAP density to pause region
#> RNAP density, fixed sites): 0.0459
#> - betaAdp (ratio of gene body RNAP density to pause region
#> RNAP density, adapted model): 0.0002
#> - fkMean (mean position of pause sites): ~ 85 bp
#> - fkVar (variance of pause site positions): 2648.42 bp^2
#>
#> To access the full rates data, use: rates(object)
treatedRates <- estimateTranscriptionRates(system.file("extdata",
"PROseq-DLD1-aoi-NELFC_Auxin-SE_plus_chr21.bw", package = "STADyUM"),
system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin-SE_minus_chr21.bw",
package = "STADyUM"),
bw_pause_filtered, bw_gb_filtered, "Auxin Treated")
#>
#> Importing bigwig files...
#>
#> Processing plus and minus strands bigwig...
#>
#> Summarizing pause and gene body regions...
#>
#> Generating read counts table...
#> estimating rates...
show(treatedRates)
#> An ExperimentTranscriptionRates object with:
#> - 80 genes
#> - 80 rate estimates
#> - Steric hindrance: FALSE
#>
#> Summary statistics for rate estimates across all genes/features:
#> - chi (gene body RNAP density): 0.02 RNAPs/bp
#> - betaOrg (ratio of gene body RNAP density to pause region
#> RNAP density, fixed sites): 0.0312
#> - betaAdp (ratio of gene body RNAP density to pause region
#> RNAP density, adapted model): 0.0001
#> - fkMean (mean position of pause sites): ~ 129 bp
#> - fkVar (variance of pause site positions): 3237.82 bp^2
#>
#> To access the full rates data, use: rates(object)
controlRatesTbl <- rates(controlRates)
head(controlRatesTbl)
#> # A tibble: 6 × 11
#> geneId chi betaOrg betaAdp fkMean fkVar totalGbRc gbLength
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 ENSG00000141956 0.0239 0.0839 0.000268 133. 3905. 1879 78612
#> 2 ENSG00000141959 0.0626 0.0491 0.000265 50.3 239. 1551 24788
#> 3 ENSG00000142166 0.0240 0.0351 0.000159 92.7 3342. 778 32387
#> 4 ENSG00000142168 0.0897 0.0262 0.000134 81.8 1289. 605 6745
#> 5 ENSG00000142185 0.0392 0.0315 0.000184 36.3 215. 584 14882
#> 6 ENSG00000142188 0.00892 0.00522 0.0000243 111. 2909. 401 44949
#> # ℹ 3 more variables: expectedPauseSiteCounts <named list>,
#> # actualPauseSiteCounts <named list>, likelihood <dbl>
Plotted below is a density plot of gene-body RNAP density values of all the genes and a parity plot for verifying the model fit. The gene-body RNAP density (chi) is an estimate of the total gene-body reads / gene-body length in RNAPs per bp. This plot indicates that most genes have low RNAP density and only a minority of the genes have higher transcriptional activity. The parity plot of expected vs. actual pause site counts per gene can be used to verify that the model fit the actual data well. In this case the Pearson correlation coefficient is 0.99 indicating the model fit is strong.
Under the model of steric hindrance, additional rates are estimated such as the landing pad occupancy (phi), pause-escape rate (betaZeta), the potential initiation rate (alphaZeta), and omegaZeta (effective initiation rate). The likelihood functions differ under the model of steric hindrance and are derived in Fitting the Steric Hindrance Model to the Data of Zhao et al. (2023). Note the omega parameter is a scaling factor.
controlRatesSH <- estimateTranscriptionRates(system.file("extdata",
"PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_plus_chr21.bw", package = "STADyUM"),
system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin_Ctrl-SE_minus_chr21.bw",
package = "STADyUM"),
bw_pause_filtered, bw_gb_filtered, "Control", stericHindrance=TRUE,
omegaScale=12.3768278981277)
show(controlRatesSH)
treatedRatesSH <- estimateTranscriptionRates(
system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin-SE_plus_chr21.bw",
package = "STADyUM"),
system.file("extdata", "PROseq-DLD1-aoi-NELFC_Auxin-SE_minus_chr21.bw",
package = "STADyUM"),
bw_pause_filtered, bw_gb_filtered, "Auxin Treated", stericHindrance=TRUE,
omegaScale=11.0318379571379)
show(treatedRatesSH)
Compare samples treated with auxin to control samples using the likelihood ratio test functionality derived in Likelihood Ratio Tests for Differences Between Transcription Units, Conditions, or Species section of Siepel (2022)
lrt <- likelihoodRatioTest(controlRates, treatedRates)
show(lrt)
#> An object of class "TranscriptionRatesLRT"
#> Slot "transcriptionRates1":
#> An ExperimentTranscriptionRates object with:
#> - 85 genes
#> - 85 rate estimates
#> - Steric hindrance: FALSE
#>
#> Summary statistics for rate estimates across all genes/features:
#> - chi (gene body RNAP density): 0.04 RNAPs/bp
#> - betaOrg (ratio of gene body RNAP density to pause region
#> RNAP density, fixed sites): 0.0459
#> - betaAdp (ratio of gene body RNAP density to pause region
#> RNAP density, adapted model): 0.0002
#> - fkMean (mean position of pause sites): ~ 85 bp
#> - fkVar (variance of pause site positions): 2648.42 bp^2
#>
#> To access the full rates data, use: rates(object)
#>
#> Slot "transcriptionRates2":
#> An ExperimentTranscriptionRates object with:
#> - 80 genes
#> - 80 rate estimates
#> - Steric hindrance: FALSE
#>
#> Summary statistics for rate estimates across all genes/features:
#> - chi (gene body RNAP density): 0.02 RNAPs/bp
#> - betaOrg (ratio of gene body RNAP density to pause region
#> RNAP density, fixed sites): 0.0312
#> - betaAdp (ratio of gene body RNAP density to pause region
#> RNAP density, adapted model): 0.0001
#> - fkMean (mean position of pause sites): ~ 129 bp
#> - fkVar (variance of pause site positions): 3237.82 bp^2
#>
#> To access the full rates data, use: rates(object)
#>
#> Slot "spikeInFile":
#> NULL
#>
#> Slot "chiTbl":
#> # A tibble: 80 × 7
#> geneId chi1 chi2 lfc tStats p padj
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ENSG00000141956 0.0239 0.0107 -1.16 203. 2.29e- 90 2.04e- 89
#> 2 ENSG00000141959 0.0626 0.0250 -1.33 207. 6.02e- 92 6.02e- 91
#> 3 ENSG00000142166 0.0240 0.0122 -0.982 64.1 1.03e- 29 2.17e- 29
#> 4 ENSG00000142168 0.0897 0.0455 -0.979 49.6 2.30e- 23 4.18e- 23
#> 5 ENSG00000142185 0.0392 0.0177 -1.15 62.4 5.78e- 29 1.19e- 28
#> 6 ENSG00000142188 0.00892 0.00587 -0.603 14.2 9.73e- 8 1.20e- 7
#> 7 ENSG00000142192 0.0495 0.0426 -0.218 23.7 5.97e- 12 8.53e- 12
#> 8 ENSG00000142197 0.0105 0.00469 -1.16 103. 1.47e- 46 4.51e- 46
#> 9 ENSG00000142207 0.0568 0.0295 -0.947 351. 1.64e-154 6.57e-153
#> 10 ENSG00000154639 0.0233 0.0134 -0.803 107. 1.97e- 48 6.56e- 48
#> # ℹ 70 more rows
#>
#> Slot "betaTbl":
#> # A tibble: 80 × 11
#> geneId beta1 beta2 lfc fkMean1 fkMean2 fkVar1 fkVar2 tStats p
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ENSG0… 2.68e-4 1.79e-4 -0.580 133. 142. 3905. 3779. 2.81e+0 1.77e- 2
#> 2 ENSG0… 2.65e-4 9.00e-5 -1.56 50.3 104. 239. 4039. 8.53e+1 5.58e-39
#> 3 ENSG0… 1.59e-4 6.10e-5 -1.38 92.7 142. 3342. 3116. 4.04e+1 2.38e-19
#> 4 ENSG0… 1.34e-4 5.08e-5 -1.40 81.8 99.0 1289. 1113. 1.80e+2 1.84e-80
#> 5 ENSG0… 1.84e-4 2.57e-4 0.481 36.3 120. 215. 5191. 9.87e-1 1.60e- 1
#> 6 ENSG0… 2.43e-5 1.84e-5 -0.397 111. 146. 2909. 3275. 6.41e+0 3.45e- 4
#> 7 ENSG0… 1.62e-4 1.43e-4 -0.186 76.1 151. 2719. 4207. 3.21e+0 1.13e- 2
#> 8 ENSG0… 4.12e-5 4.02e-5 -0.0365 76.9 142. 2620. 4717. 3.73e-2 7.85e- 1
#> 9 ENSG0… 1.47e-4 6.10e-5 -1.27 94.6 119. 2790. 2115. 8.38e+1 2.51e-38
#> 10 ENSG0… 1.59e-4 5.98e-5 -1.41 103. 142. 3851. 3592. 5.24e+1 1.40e-24
#> # ℹ 70 more rows
#> # ℹ 1 more variable: padj <dbl>
As expected given current literature, RNAPs tend to pause more downstream when NELF is absent due to Auxin treatment. As shown in the plot, the majority of treated genes have average pause locations of more than 100bp downstream of the TSS. Strikingly, few RNAPs could enter productive elongation after NELF depletion and treated samples had a signficantly lower pause-escape rate than the control samples exemplified by the beta violin plots.
(A) Conceptual illustration of model focusing on the kinetic model for RNAP movement on the DNA template. (B) Graphical model representation with unobserved continuous-time Markov chain (Z) and observed read counts (X). Read counts at each site are conditionally independent and Poisson distributed. (C) Design of SimPol simulator tracking the movement of in silico RNAPs across N-bp DNA templates in C cells, then samples the synthetic read counts based on RNAP positions. (D) Example of synthetic nascent RNA sequencing data from SimPol alongside matched real PRO-seq data from the DNAJA1 gene on chromosome 9 of the human genome
Simulate read counts by running simulations of polymerase movement along the gene at user-specified rates and gene lengths. Our simulator models the movement of RNAPs along a set of transcription units given a continuous-time Markov model. Nascent RNA sequencing read counts are then generated under steady-state conditions by sampling based on a Poisson distribution with mean equal to the simulated RNAP frequency multiplied by a user-specified sequencing depth. A graphical model representation (Figure B) with unobserved continuous-time Markov chain (Z_i) in which user-specified paramters for transcription initiation (alpha), pause-escape, and elongation (zeta) are included as well as the observed read counts (Xi). The initial model is extended to allow for variability across cells in promoter-proximal pause site locations and steric hindrance of transcription initiation from paused RNAPs (Figure C). More information on the simulator can be found in the sections A simple probabilistic model for transcription initiation, promoter-proximal pausing, and elongation, SimPol Simulator, Generation of synthetic NRS data of Zhao et al. (2023)
simpol <- simulatePolymerase(k=50, ksd=25, kMin=17, kMax=200, geneLen=1950,
alpha=2, beta=0.5, zeta=2000, zetaSd=1000, zetaMin=1500, zetaMax=2500,
cellNum=100, polSize=33, addSpace=17, time=39, timesToRecord=NULL)
#> Total time used for the simulation is 0.0166667 mins.
show(simpol)
#> A SimulatePolymerase object with:
#> - gene length = 1950
#> - number of cells = 100
#> - pause sites mean = 55.72
#> - pause sites std dev = 21.09
#> - top 10 most occupied sites across all cells:
#> position 0: 100 polymerases
#> position 4: 6 polymerases
#> position 54: 6 polymerases
#> position 39: 5 polymerases
#> position 1: 4 polymerases
#> position 3: 4 polymerases
#> position 32: 4 polymerases
#> position 51: 4 polymerases
#> position 53: 4 polymerases
#> position 6: 3 polymerases
#> - gene body average read counts = 0.04
#> - pause region average read counts = 2.6
#> - percent of nucleotides with zero reads = 94.6 %
#>
#> To access the full simulation parameters, use: parameters(object)
#>
#> To access the all sampled read counts, use: readCounts(object)
readCounts <- readCounts(simpol)
plotCombinedCells(simpol, file="combinedCellsLollipopPlot.png")
This uses the same core Expectation Maximization algorithm as the experimental data, but it is able to handle the simulated data as well.
simRates <- estimateTranscriptionRates(simpol, name="simRatesb0.5k50",
stericHindrance=TRUE)
#> Starting EM algorithm...
show(simRates)
#> A SimulationTranscriptionRates object with:
#> - Steric hindrance: TRUE
#> - Number of Trials with 5000 cells: 50
#>
#> Summary statistics for rate estimates:
#> - chi (gene body RNAP density): 0.28 RNAPs/bp
#> - betaOrg (ratio of gene body RNAP density to pause region
#> RNAP density, fixed sites): 0.5230
#> - betaAdp (ratio of gene body RNAP density to pause region
#> RNAP density, adapted model): 0.0036
#> - TSS Read Counts Averaged Across Trials: ~ 110 read
#> counts
#> - Gene Body Read Counts Averaged Across Trials: ~ 497 read
#> counts
#> - Landing Pad Read Counts Averaged Across Trials: ~ 4048
#> read counts
#> - phi (landing pad occupancy): 0.63
#>
#> EM Algorithm Convergence:
#> - Trials converged normally: 50/50 (100.0%)
#> - Trials converged to single site: 0/50 (0.0%)
#> - Trials reached max iterations without convergence: 0/50
#> (0.0%)
Likelihood ratio tests can also be applied to simulated data. In this example, the simulated rates for the first simpol object is compared to the simulated rates from the second simpol object. As expected given the simpol parameters, the plots show that the second simulated rates have higher beta and higher mean pause sites
simpol2 <- simulatePolymerase(k=100, ksd=25, kMin=75, kMax=200, geneLen=1950,
alpha=2, beta=10, zeta=2000, zetaSd=1000, zetaMin=1500, zetaMax=2500,
cellNum=100, polSize=33, addSpace=17, time=30, timesToRecord=NULL)
simRates2 <- estimateTranscriptionRates(simpol2, name="simRatesb10k100",
stericHindrance=TRUE)
simLRT <- likelihoodRatioTest(simRates, simRates2)
plotPauseSiteContourMapTwoConditions(simLRT, file="simPauseSitesContourMap.png")
BetaViolinPlot(simLRT, file="simBetaViolinPlot.png")
sessionInfo()
#> R version 4.5.1 Patched (2025-08-23 r88802)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] STADyUM_0.99.4
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.39.2 gtable_0.3.6
#> [3] rjson_0.2.23 xfun_0.53
#> [5] bslib_0.9.0 ggplot2_4.0.0
#> [7] plyranges_1.29.1 Biobase_2.69.1
#> [9] lattice_0.22-7 vctrs_0.6.5
#> [11] tools_4.5.1 bitops_1.0-9
#> [13] generics_0.1.4 stats4_4.5.1
#> [15] curl_7.0.0 parallel_4.5.1
#> [17] tibble_3.3.0 pkgconfig_2.0.3
#> [19] Matrix_1.7-4 data.table_1.17.8
#> [21] RColorBrewer_1.1-3 S7_0.2.0
#> [23] S4Vectors_0.47.4 lifecycle_1.0.4
#> [25] compiler_4.5.1 farver_2.1.2
#> [27] textshaping_1.0.4 progress_1.2.3
#> [29] Rsamtools_2.25.3 Biostrings_2.77.2
#> [31] Seqinfo_0.99.2 codetools_0.2-20
#> [33] GenomeInfoDb_1.45.12 htmltools_0.5.8.1
#> [35] sass_0.4.10 RCurl_1.98-1.17
#> [37] yaml_2.3.10 tidyr_1.3.1
#> [39] pillar_1.11.1 crayon_1.5.3
#> [41] jquerylib_0.1.4 MASS_7.3-65
#> [43] BiocParallel_1.43.4 cachem_1.1.0
#> [45] DelayedArray_0.35.3 abind_1.4-8
#> [47] tidyselect_1.2.1 digest_0.6.37
#> [49] purrr_1.1.0 dplyr_1.1.4
#> [51] restfulr_0.0.16 labeling_0.4.3
#> [53] fastmap_1.2.0 grid_4.5.1
#> [55] cli_3.6.5 SparseArray_1.9.1
#> [57] magrittr_2.0.4 S4Arrays_1.9.1
#> [59] utf8_1.2.6 dichromat_2.0-0.1
#> [61] XML_3.99-0.19 withr_3.0.2
#> [63] prettyunits_1.2.0 scales_1.4.0
#> [65] UCSC.utils_1.5.0 rmarkdown_2.30
#> [67] XVector_0.49.1 httr_1.4.7
#> [69] matrixStats_1.5.0 ragg_1.5.0
#> [71] hms_1.1.3 evaluate_1.0.5
#> [73] knitr_1.50 BiocIO_1.19.0
#> [75] GenomicRanges_1.61.5 IRanges_2.43.5
#> [77] rtracklayer_1.69.1 rlang_1.1.6
#> [79] Rcpp_1.1.0 glue_1.8.0
#> [81] BiocGenerics_0.55.3 jsonlite_2.0.0
#> [83] R6_2.6.1 systemfonts_1.3.1
#> [85] MatrixGenerics_1.21.0 GenomicAlignments_1.45.4
Aoi, Yuki, Edwin R. Smith, Avani P. Shah, Emily J. Rendleman, Stacy A. Marshall, Ashley R. Woodfin, Fei X. Chen, Ramin Shiekhattar, and Ali Shilatifard. 2020. “NELF Regulates a Promoter-Proximal Step Distinct from Rna Pol Ii Pause-Release.” Molecular Cell 78 (2): 261–274.e5. https://doi.org/10.1016/j.molcel.2020.02.014.
Siepel, Adam. 2022. “A Unified Probabilistic Modeling Framework for Eukaryotic Transcription Based on Nascent Rna Sequencing Data.” bioRxiv. https://doi.org/10.1101/2021.01.12.426408.
Zhao, Yixin, Lingjie Liu, Rebecca Hassett, and Adam Siepel. 2023. “Model-Based Characterization of the Equilibrium Dynamics of Transcription Initiation and Promoter-Proximal Pausing in Human Cells.” Nucleic Acids Research 51 (21): e106–e106. https://doi.org/10.1093/nar/gkad843.