Comparing samples with SETA

Introduction

With compositional data analysis (CoDA), we can make sample-level comparisons of scRNAseq data based on the relative abundance celltypes that compose them.

In this vignette we show how to:

Calculate compositional distances between samples
Perform statistical tests with a focus on rank-based methods, which are amenable to relative data
Correlate celltype compositions and metadata
Create predictive models using popular statistical modeling methods

This example works with the Tabula Muris Senis lung dataset which can be downloaded directly with bioconductor It contains lung single-cell profiles from 14 donor mice (mouse_id). Each mouse is assigned to one of five age groups, from young adult to old, recorded in the age column. Standard cell-type labels (free_annotation) and additional metadata (e.g. sex) are also included.

This vignette is modular; users can update or replace the dataset-specific settings (e.g., metadata column names, reference cell type) as appropriate for their own data.

Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SETA")

Load libraries

library(SingleCellExperiment)
library(SETA)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrplot)
library(caret)
have_ml <- all(vapply(c("caret","glmnet"), requireNamespace, logical(1), quietly = TRUE))
if (!have_ml) {
    warning("Some machine learning packages are not installed. ML chunks will be skipped.")
}
library(TabulaMurisSenisData)

Load and prepare data

Quick data exploration:

table(sce$free_annotation, sce$age)

##                                    
##                                       1m   3m  18m  21m  30m
##   Adventitial Fibroblast              29   17   56   43   23
##   Airway Smooth Muscle                 2    0   10    0    0
##   Alveolar Epithelial Type 2           7    2   35   18   11
##   Alveolar Fibroblast                 87   37  166  101   26
##   Alveolar Macrophage                517  116  193   91  276
##   Artery                              13   16   32   26   11
##   B                                   98  135  459  198  180
##   Basophil                             9    8   19   46   40
##   CD4+ T                             143  122   72  107  106
##   CD8+ T                              38   70  234  335  193
##   Capillary                          194  119  617  522  343
##   Capillary Aerocyte                 125   35  129  148   93
##   Ccr7+ Dendritic                      2    2    2    5    7
##   Ciliated                            10    0    9    4    3
##   Classical Monocyte                 485  111  968  242 3615
##   Club                                 2    0    4    5    3
##   Intermediate Monocyte              111   29   95   19 1495
##   Interstitial Macrophage              7   13   16   21 1098
##   Ly6g5b+ T                            0    0   53   65   42
##   Lympatic                             6    2   11   12   10
##   Myeloid Dendritic Type 1            38   20   31   26   44
##   Myeloid Dendritic Type 2             8   10   17   16   26
##   Myofibroblast                       23   13    9    9    0
##   Natural Killer                     140  201  153  224  120
##   Natural Killer T                    20    7  152   50  191
##   Neuroendocrine                       0    0    0    0    0
##   Neutrophil                          64   22  413   29   22
##   Nonclassical Monocyte              157  160  187  194  287
##   Pericyte                            11    7    8    1    8
##   Plasma                               3    1    5   16   23
##   Plasmacytoid Dendritic              21    8   15   10   14
##   Proliferating Alveolar Macrophage   60    9    7    0   14
##   Proliferating Classical Monocyte     1    0   13    0 2473
##   Proliferating Dendritic              4    3    8    4    8
##   Proliferating NK                     5    2    5    3    9
##   Proliferating T                     15    9   11   18   36
##   Regulatory T                         8   11   28   20   55
##   Vein                                19   28   93  107   73
##   Zbtb32+ B                           26   18   41  216  115

# Set up a color palette for plots
# 3-group distinct categoricals
age_palette <- c(
    "1m" = "#90EE90",
    "3m" = "#4CBB17",
    "18m" = "#228B22",
    "21m" = "#355E3B",
    "30m" = "#023020")

# continuous palette of similar look
c_palette <- colorRampPalette(c("#3B9AB2", "#78B7C5",
                                "#EBCC2A", "#E1AF00",
                                "#F21A00"))(100)

Extracting the Taxonomic Counts Matrix

The setaCounts function extracts a cell-type counts matrix given that the object contains cell-level metadata. For this dataset, we want the cell-type annotations to come from Celltype and the sample identifiers from donor_id. If necessary, we reassign these columns before extraction.

df <- data.frame(colData(sce))

df$mouse.id <- gsub("/", "", df$mouse.id)

taxa_counts <- setaCounts(
    df,
    cell_type_col = "free_annotation",
    sample_col = "mouse.id",
    bc_col = "cell")

## Converting factor class columns to character: age, cell_ontology_class, cell_ontology_id, free_annotation, method, sex, subtissue, tissue, tissue_free_annotation, louvain, leiden

taxa_counts[1:5, 1:5]

##          
##           Adventitial Fibroblast Airway Smooth Muscle
##   1-M-62                      20                    1
##   1-M-63                       9                    1
##   18-F-50                      2                    1
##   18-F-51                     23                    8
##   18-M-52                     16                    1
##          
##           Alveolar Epithelial Type 2 Alveolar Fibroblast Alveolar Macrophage
##   1-M-62                           7                  35                 148
##   1-M-63                           0                  52                 369
##   18-F-50                          0                   4                  15
##   18-F-51                          4                  75                  88
##   18-M-52                         20                  36                  49

Prepare metadata

Extract sample-level metadata from a dataframe. It ensures that each metadata column contains unique values per sample. If a metadata column contains non-unique values within any sample, that column is excluded from the output, and the user is notified via a warning.

meta_df <- setaMetadata(
    df,
    sample_col = "mouse.id",
    meta_cols = c("age", "sex"))

meta_df[1:5, ]

##   sample_id age    sex
## 1   18-F-50 18m female
## 2   18-F-51 18m female
## 3   18-M-52 18m   male
## 4   18-M-53 18m   male
## 5   21-F-54 21m female

Calculate distances between samples

Normalize the data using Centered Log Ratio transformation (CLR)

clr_transformed <- setaTransform(taxa_counts, method = "CLR")

Compute distance in the CLR space

In compositional analysis, the Euclidean distance is the preferred metric once the bounding problem is solved, in other words, once we transform the data. John Aitchison, who wrote “The Statistical Analysis of Compositional Data” preferred the Euclidean distance of CLR transformed data, and this distance is now referred to as “Aitchison Distance”

dist_df <- setaDistances(clr_transformed)

# Merge metadata
merged_dist <- dist_df |>
    left_join(meta_df, by = c("from" = "sample_id")) |>
    left_join(meta_df, by = c("to" = "sample_id"), suffix = c(".from", ".to"))

# Create a age-age category for comparison
merged_dist$age_pair <- paste(merged_dist$age.from,
                            merged_dist$age.to,
                            sep = "-")

Visualize distances

ggplot(merged_dist, aes(x = age_pair, y = distance)) +
    geom_boxplot(fill = "grey90") +
    geom_jitter(width = 0.2, color = "black") +
    labs(title = "Aitchison Distances Between Age Groups",
        x = "Age Pair",
        y = "Aitchison Distance") +
    theme_minimal(base_size = 16) + 
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Perform wilcoxon rank-sum tests on CoDA transformed data

clr_long <- as.data.frame(clr_transformed$counts)
colnames(clr_long) <- c("sample", "Celltype", "CLR")
clr_long <- clr_long |>
    left_join(meta_df, by = c("sample" = "sample_id"))

# Apply pairwise Wilcoxon tests and plot using ggpubr
ggplot(clr_long, aes(x = age, y = CLR,
                    fill = age, color = age)) +
    geom_boxplot(position = position_dodge(0.8), alpha = 0.7) +
    geom_jitter(size = 1.5, shape = 21) +
    # stat_compare_means(method = "wilcox.test",
    #                    label = "p.signif",
    #                    comparisons = list(c("normal", "influenza"),
    #                                       c("normal", "COVID-19"),
    #                                       c("influenza", "COVID-19")),
    #                    position = position_dodge(0.8)) +
    facet_wrap(~ Celltype) +
    theme_minimal(base_size = 12) +
    scale_fill_manual(values = age_palette) +
    scale_color_manual(values = age_palette) +
    theme(axis.text.x = element_text(angle = 45, hjust = 0.5, vjust = 1)) +
    labs(title = "CLR by Celltype and Age",
        x = "Age",
        y = "CLR-transformed Composition")

Correlate celltype compositions with metadata

clr_df <- clr_transformed$counts |>
    data.frame() |> # as.data.frame converts it to long form
    pivot_wider(names_from='Var2',
                values_from = "Freq") |>
    rename(`mouse.id` = Var1)

clr_metadata <- clr_df |>
    left_join(meta_df, by = c("mouse.id" = "sample_id"))

clr_data <- clr_metadata |> select(where(is.numeric))

# One-hot encode 'age' and clean column names
oh <- model.matrix(~age - 1, data = clr_metadata)
colnames(oh) <- sub("^age", "", colnames(oh))
rownames(clr_data) <- clr_metadata$mouse.id

## Warning: Setting row names on a tibble is deprecated.

# Combine CLR data and metadata
combined <- cbind(clr_data, oh)

# Compute full correlation matrix
full_cor_mat <- cor(combined, method = "pearson")
p_mat <- cor.mtest(full_cor_mat)$p

Visualize correlations

corrplot(full_cor_mat,
        method = "circle",
        type = "full",
        addrect = 4,
        col = c_palette,
        p.mat = p_mat,
        sig.level = c(.001, .01, .05),
        insig = "label_sig",
        tl.cex = 0.8,
        pch.cex = 1.5,
        tl.col = "black",
        order = "hclust",
        diag = FALSE)

Use SETA transformed data to create predictive models with caret

set.seed(687)
train_df <- clr_metadata |>
    select(-`mouse.id`) |>
    mutate(age = factor(age))

train_control <- trainControl(method = "cv", number = 5)

model <- train(age ~ ., data = subset(train_df),
                method = "glmnet",
                trControl = train_control)

importance <- varImp(model)

plot(importance,
    main = "Cross-Validated Variable Importance",
    sub = "Caret GLMnet model"
)

Conclusion

Compositional data provides an intuitive basis for making sample-level comparisons in scRNAseq data, as it provides a single sample x feature space within which to make comparisons. Let us know if you have more ideas for using CoDA!

Session Info

sessionInfo()

## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] rhdf5_2.54.0                TabulaMurisSenisData_1.15.1
##  [3] caret_7.0-1                 lattice_0.22-7             
##  [5] corrplot_0.95               tidyr_1.3.1                
##  [7] dplyr_1.1.4                 ggplot2_4.0.0              
##  [9] SETA_1.0.0                  SingleCellExperiment_1.32.0
## [11] SummarizedExperiment_1.40.0 Biobase_2.70.0             
## [13] GenomicRanges_1.62.0        Seqinfo_1.0.0              
## [15] IRanges_2.44.0              S4Vectors_0.48.0           
## [17] BiocGenerics_0.56.0         generics_0.1.4             
## [19] MatrixGenerics_1.22.0       matrixStats_1.5.0          
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3            pROC_1.19.0.1        httr2_1.2.1         
##   [4] rlang_1.1.6          magrittr_2.0.4       e1071_1.7-16        
##   [7] compiler_4.5.1       RSQLite_2.4.3        gdata_3.0.1         
##  [10] png_0.1-8            vctrs_0.6.5          reshape2_1.4.4      
##  [13] stringr_1.5.2        crayon_1.5.3         pkgconfig_2.0.3     
##  [16] shape_1.4.6.1        fastmap_1.2.0        dbplyr_2.5.1        
##  [19] XVector_0.50.0       labeling_0.4.3       rmarkdown_2.30      
##  [22] prodlim_2025.04.28   bit_4.6.0            purrr_1.1.0         
##  [25] xfun_0.53            glmnet_4.1-10        cachem_1.1.0        
##  [28] jsonlite_2.0.0       blob_1.2.4           recipes_1.3.1       
##  [31] rhdf5filters_1.22.0  DelayedArray_0.36.0  Rhdf5lib_1.32.0     
##  [34] parallel_4.5.1       R6_2.6.1             bslib_0.9.0         
##  [37] stringi_1.8.7        RColorBrewer_1.1-3   parallelly_1.45.1   
##  [40] rpart_4.1.24         lubridate_1.9.4      jquerylib_0.1.4     
##  [43] Rcpp_1.1.0           iterators_1.0.14     knitr_1.50          
##  [46] future.apply_1.20.0  Matrix_1.7-4         splines_4.5.1       
##  [49] nnet_7.3-20          igraph_2.2.1         timechange_0.3.0    
##  [52] tidyselect_1.2.1     dichromat_2.0-0.1    abind_1.4-8         
##  [55] yaml_2.3.10          timeDate_4051.111    codetools_0.2-20    
##  [58] curl_7.0.0           listenv_0.9.1        tibble_3.3.0        
##  [61] plyr_1.8.9           KEGGREST_1.50.0      withr_3.0.2         
##  [64] S7_0.2.0             evaluate_1.0.5       future_1.67.0       
##  [67] survival_3.8-3       proxy_0.4-27         BiocFileCache_3.0.0 
##  [70] ExperimentHub_3.0.0  Biostrings_2.78.0    filelock_1.0.3      
##  [73] pillar_1.11.1        BiocManager_1.30.26  foreach_1.5.2       
##  [76] BiocVersion_3.22.0   scales_1.4.0         gtools_3.9.5        
##  [79] BiocStyle_2.38.0     globals_0.18.0       class_7.3-23        
##  [82] glue_1.8.0           tools_4.5.1          AnnotationHub_4.0.0 
##  [85] data.table_1.17.8    ModelMetrics_1.2.2.2 gower_1.0.2         
##  [88] tidygraph_1.3.1      grid_4.5.1           AnnotationDbi_1.72.0
##  [91] ipred_0.9-15         nlme_3.1-168         HDF5Array_1.38.0    
##  [94] cli_3.6.5            rappdirs_0.3.3       S4Arrays_1.10.0     
##  [97] lava_1.8.1           gtable_0.3.6         sass_0.4.10         
## [100] digest_0.6.37        SparseArray_1.10.0   farver_2.1.2        
## [103] memoise_2.0.1        htmltools_0.5.8.1    lifecycle_1.0.4     
## [106] h5mread_1.2.0        httr_1.4.7           hardhat_1.4.2       
## [109] bit64_4.6.0-1        MASS_7.3-65