Contents

1 Overview

This vignette compares the two main models in the CLAMP package:

  1. CLAMPbase: Unsupervised matrix factorization for dimensionality reduction without prior knowledge.
  2. CLAMPfull: Pathway-guided refinement that integrates biological prior knowledge to improve interpretability.

Using a small human whole blood RNA-Seq dataset, we demonstrate that incorporating pathway priors in CLAMPfull improves the biological interpretability of latent variables compared to the baseline CLAMPbase model.

We illustrate how to:

1.1 1. Load Data and Normalize

data("dataWholeBlood")
data("majorCellTypes")
data("celltypeTargets")

# Scale each gene to mean 0 and variance 1
dataWholeBlood <- tscale(dataWholeBlood)

1.2 2. Load Prior Knowledge Matrices

# How to download pathway and cell marker libraries from Enrichr.
# Not run during vignette build to avoid network calls; pre-fetched
# .rds files are loaded in the next chunk instead.
enrichr_url <- "https://maayanlab.cloud/Enrichr/geneSetLibrary"
gmtList <- list(
    CellMarkers = getGMT(
        paste0(enrichr_url, "?mode=text&libraryName=CellMarker_2024"),
        "CellMarker_2024"
    ),
    KEGG = getGMT(
        paste0(enrichr_url, "?mode=text&libraryName=KEGG_2021_Human"),
        "KEGG_2021_Human"
    )
)
# Load pre-fetched gene set libraries bundled with the package
gmtList <- list(
    CellMarkers = readRDS(
        system.file("extdata", "CellMarker_2024.rds", package = "CLAMP")
    ),
    KEGG = readRDS(
        system.file("extdata", "KEGG_2021_Human.rds", package = "CLAMP")
    )
)

# Combine into a single sparse matrix
pathMatCell <- gmtListToSparseMat(gmtList)

# Load additional xCell reference matrix
data("xCell")

# Match pathways to the gene space of whole blood
matchedPathsWB <- getMatchedPathwayMatList(
    pathMatCell,
    xCell,
    new.genes = rownames(dataWholeBlood),
    min.genes = 2
)

1.3 3. Compute SVD and Infer k

set.seed(1)
wb_svd_k <- select_svd_k(dataWholeBlood)
wb_svd <- compute_svd(dataWholeBlood, k = wb_svd_k)
wb_clamp_k <- select_clamp_k(wb_svd,
    n_samples = ncol(dataWholeBlood),
    svd_k = wb_svd_k
)
wb_clamp_k
## [1] 8

1.4 4. Fit CLAMPbase and CLAMPfull

wb_clamp_base <- CLAMPbase(
    dataWholeBlood,
    svdres     = wb_svd,
    clamp_k    = wb_clamp_k,
    trace      = FALSE,
    adaptive.p = 0.05
)
wb_clamp_full <- CLAMPfull(
    dataWholeBlood,
    priorMat          = matchedPathsWB,
    svdres            = wb_svd,
    clamp.base.result = wb_clamp_base,
    clamp_k           = wb_clamp_k,
    trace             = TRUE,
    use_cpp           = TRUE
)

1.5 5. Compare CLAMPbase vs CLAMPfull

This plot compares the maximum Spearman correlation for each major blood cell type between CLAMPbase and CLAMPfull.

Points above the red dashed line indicate improved correspondence when biological priors are included.

Most cell types show higher correlations under CLAMPfull, demonstrating that integrating pathway information helps capture more biologically meaningful latent variables.

output <- compareBs(
    wb_clamp_base,
    wb_clamp_full,
    celltypeTargets,
    method = "s",
    xlab   = "CLAMPbase",
    ylab   = "CLAMPfull"
)
## [1]  8 36
## [1]  8 36
output$plot

1.6 6. Inspect Named Matrix Outputs

CLAMPbase and CLAMPfull now return B (gene loadings, LVs × genes) and Z (sample scores, LVs × samples) as proper named matrices.

# B: gene loadings (LVs × genes)
dim(wb_clamp_full$B)
## [1]  8 36
wb_clamp_full$B[1:3, 1:4]
##         BD8001    BD8002      BD8003    BD8004
## LV1  1.5208805  1.019819  1.52010647  1.723528
## LV2 -0.3858266  1.290221 -2.01065989 -2.148343
## LV3 -0.7590255 -1.089290 -0.02649019 -3.321543
# Z: sample scores (LVs × samples)
dim(wb_clamp_full$Z)
## [1] 11530     8
wb_clamp_full$Z[1:3, 1:4]
##          LV1       LV2 LV3 LV4
## GAS6       0 0.0000000   0   0
## MMP14      0 0.0000000   0   0
## MARCKSL1   0 0.3406663   0   0

1.7 7. Verify FBM Implementation

We verify that CLAMPfull produces identical results when using a file-backed matrix (FBM) input instead of an in-memory matrix. We reuse the same pre-computed SVD and clamp_k so that any differences are attributable solely to the matrix format, not the randomized SVD.

dataWholeBloodFBM <- bigstatsr::as_FBM(dataWholeBlood)

wb_clamp_full_fbm <- CLAMPfull(
    dataWholeBloodFBM,
    priorMat          = matchedPathsWB,
    svdres            = wb_svd,
    clamp.base.result = wb_clamp_base,
    clamp_k           = wb_clamp_k,
    trace             = TRUE,
    use_cpp           = TRUE
)

The FBM implementation produces identical results:

output <- compareBs(
    wb_clamp_full,
    wb_clamp_full_fbm,
    celltypeTargets,
    method = "s",
    xlab   = "CLAMPfull (matrix)",
    ylab   = "CLAMPfull (FBM)"
)
## [1]  8 36
## [1]  8 36
output$plot

2 Session Information

sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bigstatsr_1.6.2  CLAMP_0.99.0     BiocStyle_2.40.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6          circlize_0.4.18       shape_1.4.6.1        
##  [4] rjson_0.2.23          xfun_0.57             bslib_0.11.0         
##  [7] ggplot2_4.0.3         GlobalOptions_0.1.4   ggrepel_0.9.8        
## [10] lattice_0.22-9        bigassertr_0.1.7      ps_1.9.3             
## [13] vctrs_0.7.3           tools_4.6.0           generics_0.1.4       
## [16] stats4_4.6.0          parallel_4.6.0        tibble_3.3.1         
## [19] cluster_2.1.8.2       pkgconfig_2.0.3       Matrix_1.7-5         
## [22] RColorBrewer_1.1-3    S7_0.2.2              S4Vectors_0.50.1     
## [25] lifecycle_1.0.5       compiler_4.6.0        farver_2.1.2         
## [28] tinytex_0.59          bigparallelr_0.3.2    codetools_0.2-20     
## [31] ComplexHeatmap_2.28.0 clue_0.3-68           htmltools_0.5.9      
## [34] sass_0.4.10           yaml_2.3.12           glmnet_5.0           
## [37] pillar_1.11.1         crayon_1.5.3          jquerylib_0.1.4      
## [40] cachem_1.1.0          magick_2.9.1          iterators_1.0.14     
## [43] foreach_1.5.2         rsvd_1.0.5            tidyselect_1.2.1     
## [46] digest_0.6.39         dplyr_1.2.1           bookdown_0.46        
## [49] labeling_0.4.3        splines_4.6.0         cowplot_1.2.0        
## [52] fastmap_1.2.0         grid_4.6.0            colorspace_2.1-2     
## [55] cli_3.6.6             magrittr_2.0.5        dichromat_2.0-0.1    
## [58] survival_3.8-6        withr_3.0.2           scales_1.4.0         
## [61] rmarkdown_2.31        matrixStats_1.5.0     rmio_0.4.0           
## [64] bit_4.6.0             otel_0.2.0            png_0.1-9            
## [67] GetoptLong_1.1.1      evaluate_1.0.5        ff_4.5.2             
## [70] knitr_1.51            IRanges_2.46.0        doParallel_1.0.17    
## [73] irlba_2.3.7           rlang_1.2.0           Rcpp_1.1.1-1.1       
## [76] glue_1.8.1            BiocManager_1.30.27   BiocGenerics_0.58.1  
## [79] jsonlite_2.0.0        R6_2.6.1              flock_0.7