With compositional data analysis (CoDA), we can make sample-level comparisons of scRNAseq data based on the relative abundance celltypes that compose them.
In this vignette we show how to:
This example works with the Tabula Muris Senis lung dataset which can be downloaded directly with bioconductor It contains lung single-cell profiles from 14 donor mice (mouse_id). Each mouse is assigned to one of five age groups, from young adult to old, recorded in the age column. Standard cell-type labels (free_annotation) and additional metadata (e.g. sex) are also included.
This vignette is modular; users can update or replace the dataset-specific settings (e.g., metadata column names, reference cell type) as appropriate for their own data.
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SETA")library(SingleCellExperiment)
library(SETA)
library(ggplot2)
library(dplyr)
library(tidyr)
library(corrplot)
library(caret)
have_ml <- all(vapply(c("caret","glmnet"), requireNamespace, logical(1), quietly = TRUE))
if (!have_ml) {
    warning("Some machine learning packages are not installed. ML chunks will be skipped.")
}
library(TabulaMurisSenisData)Quick data exploration:
table(sce$free_annotation, sce$age)##                                    
##                                       1m   3m  18m  21m  30m
##   Adventitial Fibroblast              29   17   56   43   23
##   Airway Smooth Muscle                 2    0   10    0    0
##   Alveolar Epithelial Type 2           7    2   35   18   11
##   Alveolar Fibroblast                 87   37  166  101   26
##   Alveolar Macrophage                517  116  193   91  276
##   Artery                              13   16   32   26   11
##   B                                   98  135  459  198  180
##   Basophil                             9    8   19   46   40
##   CD4+ T                             143  122   72  107  106
##   CD8+ T                              38   70  234  335  193
##   Capillary                          194  119  617  522  343
##   Capillary Aerocyte                 125   35  129  148   93
##   Ccr7+ Dendritic                      2    2    2    5    7
##   Ciliated                            10    0    9    4    3
##   Classical Monocyte                 485  111  968  242 3615
##   Club                                 2    0    4    5    3
##   Intermediate Monocyte              111   29   95   19 1495
##   Interstitial Macrophage              7   13   16   21 1098
##   Ly6g5b+ T                            0    0   53   65   42
##   Lympatic                             6    2   11   12   10
##   Myeloid Dendritic Type 1            38   20   31   26   44
##   Myeloid Dendritic Type 2             8   10   17   16   26
##   Myofibroblast                       23   13    9    9    0
##   Natural Killer                     140  201  153  224  120
##   Natural Killer T                    20    7  152   50  191
##   Neuroendocrine                       0    0    0    0    0
##   Neutrophil                          64   22  413   29   22
##   Nonclassical Monocyte              157  160  187  194  287
##   Pericyte                            11    7    8    1    8
##   Plasma                               3    1    5   16   23
##   Plasmacytoid Dendritic              21    8   15   10   14
##   Proliferating Alveolar Macrophage   60    9    7    0   14
##   Proliferating Classical Monocyte     1    0   13    0 2473
##   Proliferating Dendritic              4    3    8    4    8
##   Proliferating NK                     5    2    5    3    9
##   Proliferating T                     15    9   11   18   36
##   Regulatory T                         8   11   28   20   55
##   Vein                                19   28   93  107   73
##   Zbtb32+ B                           26   18   41  216  115# Set up a color palette for plots
# 3-group distinct categoricals
age_palette <- c(
    "1m" = "#90EE90",
    "3m" = "#4CBB17",
    "18m" = "#228B22",
    "21m" = "#355E3B",
    "30m" = "#023020")
# continuous palette of similar look
c_palette <- colorRampPalette(c("#3B9AB2", "#78B7C5",
                                "#EBCC2A", "#E1AF00",
                                "#F21A00"))(100)The setaCounts function extracts a cell-type counts matrix given that the object contains cell-level metadata. For this dataset, we want the cell-type annotations to come from Celltype and the sample identifiers from donor_id. If necessary, we reassign these columns before extraction.
df <- data.frame(colData(sce))
df$mouse.id <- gsub("/", "", df$mouse.id)
taxa_counts <- setaCounts(
    df,
    cell_type_col = "free_annotation",
    sample_col = "mouse.id",
    bc_col = "cell")## Converting factor class columns to character: age, cell_ontology_class, cell_ontology_id, free_annotation, method, sex, subtissue, tissue, tissue_free_annotation, louvain, leidentaxa_counts[1:5, 1:5]##          
##           Adventitial Fibroblast Airway Smooth Muscle
##   1-M-62                      20                    1
##   1-M-63                       9                    1
##   18-F-50                      2                    1
##   18-F-51                     23                    8
##   18-M-52                     16                    1
##          
##           Alveolar Epithelial Type 2 Alveolar Fibroblast Alveolar Macrophage
##   1-M-62                           7                  35                 148
##   1-M-63                           0                  52                 369
##   18-F-50                          0                   4                  15
##   18-F-51                          4                  75                  88
##   18-M-52                         20                  36                  49Extract sample-level metadata from a dataframe. It ensures that each metadata column contains unique values per sample. If a metadata column contains non-unique values within any sample, that column is excluded from the output, and the user is notified via a warning.
meta_df <- setaMetadata(
    df,
    sample_col = "mouse.id",
    meta_cols = c("age", "sex"))
meta_df[1:5, ]##   sample_id age    sex
## 1   18-F-50 18m female
## 2   18-F-51 18m female
## 3   18-M-52 18m   male
## 4   18-M-53 18m   male
## 5   21-F-54 21m femaleNormalize the data using Centered Log Ratio transformation (CLR)
clr_transformed <- setaTransform(taxa_counts, method = "CLR")Compute distance in the CLR space
In compositional analysis, the Euclidean distance is the preferred metric once the bounding problem is solved, in other words, once we transform the data. John Aitchison, who wrote “The Statistical Analysis of Compositional Data” preferred the Euclidean distance of CLR transformed data, and this distance is now referred to as “Aitchison Distance”
dist_df <- setaDistances(clr_transformed)
# Merge metadata
merged_dist <- dist_df |>
    left_join(meta_df, by = c("from" = "sample_id")) |>
    left_join(meta_df, by = c("to" = "sample_id"), suffix = c(".from", ".to"))
# Create a age-age category for comparison
merged_dist$age_pair <- paste(merged_dist$age.from,
                            merged_dist$age.to,
                            sep = "-")Visualize distances
ggplot(merged_dist, aes(x = age_pair, y = distance)) +
    geom_boxplot(fill = "grey90") +
    geom_jitter(width = 0.2, color = "black") +
    labs(title = "Aitchison Distances Between Age Groups",
        x = "Age Pair",
        y = "Aitchison Distance") +
    theme_minimal(base_size = 16) + 
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5))clr_long <- as.data.frame(clr_transformed$counts)
colnames(clr_long) <- c("sample", "Celltype", "CLR")
clr_long <- clr_long |>
    left_join(meta_df, by = c("sample" = "sample_id"))# Apply pairwise Wilcoxon tests and plot using ggpubr
ggplot(clr_long, aes(x = age, y = CLR,
                    fill = age, color = age)) +
    geom_boxplot(position = position_dodge(0.8), alpha = 0.7) +
    geom_jitter(size = 1.5, shape = 21) +
    # stat_compare_means(method = "wilcox.test",
    #                    label = "p.signif",
    #                    comparisons = list(c("normal", "influenza"),
    #                                       c("normal", "COVID-19"),
    #                                       c("influenza", "COVID-19")),
    #                    position = position_dodge(0.8)) +
    facet_wrap(~ Celltype) +
    theme_minimal(base_size = 12) +
    scale_fill_manual(values = age_palette) +
    scale_color_manual(values = age_palette) +
    theme(axis.text.x = element_text(angle = 45, hjust = 0.5, vjust = 1)) +
    labs(title = "CLR by Celltype and Age",
        x = "Age",
        y = "CLR-transformed Composition")clr_df <- clr_transformed$counts |>
    data.frame() |> # as.data.frame converts it to long form
    pivot_wider(names_from='Var2',
                values_from = "Freq") |>
    rename(`mouse.id` = Var1)
clr_metadata <- clr_df |>
    left_join(meta_df, by = c("mouse.id" = "sample_id"))
clr_data <- clr_metadata |> select(where(is.numeric))
# One-hot encode 'age' and clean column names
oh <- model.matrix(~age - 1, data = clr_metadata)
colnames(oh) <- sub("^age", "", colnames(oh))
rownames(clr_data) <- clr_metadata$mouse.id## Warning: Setting row names on a tibble is deprecated.# Combine CLR data and metadata
combined <- cbind(clr_data, oh)
# Compute full correlation matrix
full_cor_mat <- cor(combined, method = "pearson")
p_mat <- cor.mtest(full_cor_mat)$pVisualize correlations
corrplot(full_cor_mat,
        method = "circle",
        type = "full",
        addrect = 4,
        col = c_palette,
        p.mat = p_mat,
        sig.level = c(.001, .01, .05),
        insig = "label_sig",
        tl.cex = 0.8,
        pch.cex = 1.5,
        tl.col = "black",
        order = "hclust",
        diag = FALSE)set.seed(687)
train_df <- clr_metadata |>
    select(-`mouse.id`) |>
    mutate(age = factor(age))
train_control <- trainControl(method = "cv", number = 5)
model <- train(age ~ ., data = subset(train_df),
                method = "glmnet",
                trControl = train_control)
importance <- varImp(model)plot(importance,
    main = "Cross-Validated Variable Importance",
    sub = "Caret GLMnet model"
)Compositional data provides an intuitive basis for making sample-level comparisons in scRNAseq data, as it provides a single sample x feature space within which to make comparisons. Let us know if you have more ideas for using CoDA!
sessionInfo()## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] rhdf5_2.54.0                TabulaMurisSenisData_1.15.1
##  [3] caret_7.0-1                 lattice_0.22-7             
##  [5] corrplot_0.95               tidyr_1.3.1                
##  [7] dplyr_1.1.4                 ggplot2_4.0.0              
##  [9] SETA_1.0.0                  SingleCellExperiment_1.32.0
## [11] SummarizedExperiment_1.40.0 Biobase_2.70.0             
## [13] GenomicRanges_1.62.0        Seqinfo_1.0.0              
## [15] IRanges_2.44.0              S4Vectors_0.48.0           
## [17] BiocGenerics_0.56.0         generics_0.1.4             
## [19] MatrixGenerics_1.22.0       matrixStats_1.5.0          
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3            pROC_1.19.0.1        httr2_1.2.1         
##   [4] rlang_1.1.6          magrittr_2.0.4       e1071_1.7-16        
##   [7] compiler_4.5.1       RSQLite_2.4.3        gdata_3.0.1         
##  [10] png_0.1-8            vctrs_0.6.5          reshape2_1.4.4      
##  [13] stringr_1.5.2        crayon_1.5.3         pkgconfig_2.0.3     
##  [16] shape_1.4.6.1        fastmap_1.2.0        dbplyr_2.5.1        
##  [19] XVector_0.50.0       labeling_0.4.3       rmarkdown_2.30      
##  [22] prodlim_2025.04.28   bit_4.6.0            purrr_1.1.0         
##  [25] xfun_0.53            glmnet_4.1-10        cachem_1.1.0        
##  [28] jsonlite_2.0.0       blob_1.2.4           recipes_1.3.1       
##  [31] rhdf5filters_1.22.0  DelayedArray_0.36.0  Rhdf5lib_1.32.0     
##  [34] parallel_4.5.1       R6_2.6.1             bslib_0.9.0         
##  [37] stringi_1.8.7        RColorBrewer_1.1-3   parallelly_1.45.1   
##  [40] rpart_4.1.24         lubridate_1.9.4      jquerylib_0.1.4     
##  [43] Rcpp_1.1.0           iterators_1.0.14     knitr_1.50          
##  [46] future.apply_1.20.0  Matrix_1.7-4         splines_4.5.1       
##  [49] nnet_7.3-20          igraph_2.2.1         timechange_0.3.0    
##  [52] tidyselect_1.2.1     dichromat_2.0-0.1    abind_1.4-8         
##  [55] yaml_2.3.10          timeDate_4051.111    codetools_0.2-20    
##  [58] curl_7.0.0           listenv_0.9.1        tibble_3.3.0        
##  [61] plyr_1.8.9           KEGGREST_1.50.0      withr_3.0.2         
##  [64] S7_0.2.0             evaluate_1.0.5       future_1.67.0       
##  [67] survival_3.8-3       proxy_0.4-27         BiocFileCache_3.0.0 
##  [70] ExperimentHub_3.0.0  Biostrings_2.78.0    filelock_1.0.3      
##  [73] pillar_1.11.1        BiocManager_1.30.26  foreach_1.5.2       
##  [76] BiocVersion_3.22.0   scales_1.4.0         gtools_3.9.5        
##  [79] BiocStyle_2.38.0     globals_0.18.0       class_7.3-23        
##  [82] glue_1.8.0           tools_4.5.1          AnnotationHub_4.0.0 
##  [85] data.table_1.17.8    ModelMetrics_1.2.2.2 gower_1.0.2         
##  [88] tidygraph_1.3.1      grid_4.5.1           AnnotationDbi_1.72.0
##  [91] ipred_0.9-15         nlme_3.1-168         HDF5Array_1.38.0    
##  [94] cli_3.6.5            rappdirs_0.3.3       S4Arrays_1.10.0     
##  [97] lava_1.8.1           gtable_0.3.6         sass_0.4.10         
## [100] digest_0.6.37        SparseArray_1.10.0   farver_2.1.2        
## [103] memoise_2.0.1        htmltools_0.5.8.1    lifecycle_1.0.4     
## [106] h5mread_1.2.0        httr_1.4.7           hardhat_1.4.2       
## [109] bit64_4.6.0-1        MASS_7.3-65