This tutorial provides a comprehensive introduction to MeLSI (Metric Learning for Statistical Inference) for microbiome data analysis.
MeLSI is a novel machine learning method that learns optimal distance metrics to improve statistical power in detecting group differences in microbiome data. Unlike traditional distance metrics (Bray-Curtis, Euclidean, Jaccard), MeLSI adapts to the specific characteristics of your dataset. This package provides a statistically rigorous approach to beta diversity analysis that automatically identifies which microbial features drive group differences, making it particularly valuable for biomarker discovery and hypothesis generation in microbiome research.
MeLSI addresses a fundamental limitation in current microbiome beta diversity analysis: traditional distance metrics treat all microbial taxa equally, missing subtle but biologically important differences. MeLSI learns adaptive distance metrics optimized for your specific dataset while maintaining rigorous statistical validity through permutation testing.
MeLSI complements existing Bioconductor microbiome analysis packages:
phyloseq and microbiome: These packages provide comprehensive data structures and analysis pipelines for microbiome data. MeLSI integrates seamlessly with data from these packages (see “Working with Bioconductor Packages” section below) and adds a novel metric learning approach for beta diversity analysis.
vegan: MeLSI uses vegan::adonis2() for PERMANOVA calculations but extends beyond fixed distance metrics by learning optimal metrics from data.
Existing distance-based methods: Unlike packages that use fixed distance metrics (Bray-Curtis, Jaccard, UniFrac), MeLSI learns which features matter most for each specific comparison, providing both improved statistical power and biological interpretability through feature importance weights.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MeLSI")
# Load required packages
library(MeLSI)
library(vegan)
library(ggplot2)
# Generate synthetic microbiome data for demonstration
# Using smaller dataset for faster vignette execution
set.seed(42)
test_data <- generate_test_data(n_samples = 40, n_taxa = 50, n_signal_taxa = 5)
X <- test_data$counts
y <- test_data$metadata$Group
# Display data dimensions
cat("Data dimensions:", nrow(X), "samples x", ncol(X), "taxa\n")
## Data dimensions: 40 samples x 50 taxa
cat("Groups:", paste(unique(y), collapse = ", "), "\n")
## Groups: A, B
# CLR transformation (recommended for microbiome data)
# Use the built-in helper function for easy CLR transformation
X_clr <- clr_transform(X)
# Display preprocessing results
cat("CLR transformation completed\n")
## CLR transformation completed
cat("Data range:", round(range(X_clr), 3), "\n")
## Data range: -4.137 2.379
# Run MeLSI with default parameters
# Note: For real analyses, use n_perms = 200+ for reliable p-values
# Using reduced parameters here for faster vignette execution (<15 min for Bioconductor)
cat("Running MeLSI analysis...\n")
## Running MeLSI analysis...
melsi_results <- melsi(
X_clr, y,
n_perms = 50, # Number of permutations (use 200+ for real analysis)
B = 20, # Number of weak learners (default 30, reduced for speed)
m_frac = 0.8, # Fraction of features per learner
show_progress = TRUE
)
# Display results
cat("\nMeLSI Results:\n")
##
## MeLSI Results:
cat(sprintf("F-statistic value: %.4f\n", melsi_results$F_observed))
## F-statistic value: 1.9667
cat(sprintf("P-value: %.4f\n", melsi_results$p_value))
## P-value: 0.0196
cat(sprintf("Significant: %s\n", ifelse(melsi_results$p_value < 0.05, "Yes", "No")))
## Significant: Yes
# VIP plot is automatically generated, but you can also create it manually:
# Using top_n = 10 for faster execution
plot_vip(melsi_results, top_n = 10)
# PCoA plot using the learned MeLSI distance
plot_pcoa(melsi_results, X_clr, y)
# Compare with Euclidean distance
dist_euclidean <- dist(X_clr)
adonis_euclidean <- adonis2(dist_euclidean ~ y, permutations = 50)
# Compare with Bray-Curtis distance
dist_bray <- vegdist(X, method = "bray")
adonis_bray <- adonis2(dist_bray ~ y, permutations = 50)
# Display comparison results
cat("\nMethod Comparison:\n")
##
## Method Comparison:
cat(sprintf("MeLSI: F-statistic = %.4f, p = %.4f\n",
melsi_results$F_observed, melsi_results$p_value))
## MeLSI: F-statistic = 1.9667, p = 0.0196
cat(sprintf("Euclidean: F-statistic = %.4f, p = %.4f\n",
adonis_euclidean[["F"]][1], adonis_euclidean[["Pr(>F)"]][1]))
## Euclidean: F-statistic = 1.8405, p = 0.0196
cat(sprintf("Bray-Curtis: F-statistic = %.4f, p = %.4f\n",
adonis_bray[["F"]][1], adonis_bray[["Pr(>F)"]][1]))
## Bray-Curtis: F-statistic = 8.3403, p = 0.0196
# Calculate improvement
improvement <- (melsi_results$F_observed - adonis_euclidean[["F"]][1]) / adonis_euclidean[["F"]][1] * 100
cat(sprintf("\nMeLSI improvement over Euclidean: %.1f%%\n", improvement))
##
## MeLSI improvement over Euclidean: 6.9%
MeLSI automatically generates a VIP plot showing which features drive group differences.
You can customize it using the plot_vip() function:
# Plot VIP with directionality (default)
# Using top_n = 10 for faster execution
plot_vip(melsi_results, top_n = 10, title = "Top 10 Features by Importance")
# Plot VIP without directionality coloring
plot_vip(melsi_results, top_n = 10, directionality = FALSE)
# Extract and examine feature weights programmatically
feature_weights <- melsi_results$feature_weights
top_features <- head(sort(feature_weights, decreasing = TRUE), 10)
cat("\nTop 10 Most Weighted Features:\n")
##
## Top 10 Most Weighted Features:
for (i in 1:length(top_features)) {
cat(sprintf("%2d. %s (weight: %.4f)\n",
i, names(top_features)[i], top_features[i]))
}
## 1. Taxa1 (weight: 6.2356)
## 2. Taxa16 (weight: 5.7261)
## 3. Taxa7 (weight: 5.5360)
## 4. Taxa2 (weight: 4.7193)
## 5. Taxa38 (weight: 4.5969)
## 6. Taxa19 (weight: 4.1784)
## 7. Taxa9 (weight: 4.0455)
## 8. Taxa3 (weight: 3.9464)
## 9. Taxa22 (weight: 3.5864)
## 10. Taxa26 (weight: 3.4879)
Visualize the learned MeLSI distance using PCoA:
# Create PCoA plot using the learned MeLSI distance matrix
plot_pcoa(melsi_results, X_clr, y, title = "PCoA: MeLSI Distance")
You can customize MeLSI parameters for your specific needs:
# Run MeLSI with custom parameters
# Note: This example is set to eval=FALSE for faster vignette execution
# Uncomment and run locally for full demonstration
cat("Running MeLSI with custom parameters...\n")
custom_results <- melsi(
X_clr, y,
n_perms = 200, # More permutations for higher precision (recommended: 200+)
B = 50, # Larger ensemble for more stable results
m_frac = 0.8, # Fraction of features per learner
show_progress = TRUE
)
cat(sprintf("Custom MeLSI F-statistic value: %.4f (p = %.4f)\n",
custom_results$F_observed, custom_results$p_value))
Note: The advanced usage example above is disabled (eval=FALSE) to keep vignette execution time under 15 minutes for Bioconductor requirements. For real analyses, use the parameters shown above (n_perms = 200+, B = 30-50).
Parameter Guidelines:
- n_perms: Use 200+ for reliable p-values in real analyses (vignette uses 50 for speed)
- B: Default 30 is usually sufficient; increase to 50-100 for more stability (vignette uses 20 for speed)
- m_frac: Default 0.8 works well; lower values (0.6-0.7) can help with very high-dimensional data
MeLSI integrates seamlessly with the phyloseq package, a core Bioconductor package for microbiome data:
# Load phyloseq data
library(phyloseq)
data(GlobalPatterns) # Example phyloseq object
# Extract OTU table and sample data
X <- as(otu_table(GlobalPatterns), "matrix")
y <- sample_data(GlobalPatterns)$SampleType
# CLR transformation
X_clr <- clr_transform(X)
# Run MeLSI analysis
results <- melsi(X_clr, y, n_perms = 200, B = 30, show_progress = FALSE)
# Note: For vignette execution, use reduced parameters (n_perms=50, B=20) to stay under 15 minutes
# Results can be used with other phyloseq functions
# For example, you can add the learned distance matrix to a phyloseq object
This interoperability ensures MeLSI works within the existing Bioconductor ecosystem and can be integrated into comprehensive microbiome analysis workflows.
This tutorial demonstrated:
clr_transform()plot_vip() and plot_pcoa()clr_transform, plot_vip, plot_pcoa) simplify the workflown_perms = 200+ for reliable p-values in real analysesclr_transform() for easy CLR preprocessingplot_vip() and plot_pcoa()B, m_frac) if needed for your specific dataFor more information, see the main README.md file and the comprehensive documentation.
sessionInfo()
## R version 4.6.0 alpha (2026-03-30 r89742)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_4.0.2 vegan_2.7-3 permute_0.9-10 MeLSI_0.99.8
## [5] BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.57 bslib_0.10.0
## [4] Biobase_2.71.0 lattice_0.22-9 vctrs_0.7.2
## [7] tools_4.6.0 generics_0.1.4 biomformat_1.39.16
## [10] stats4_4.6.0 parallel_4.6.0 tibble_3.3.1
## [13] cluster_2.1.8.2 pkgconfig_2.0.3 Matrix_1.7-5
## [16] data.table_1.18.2.1 RColorBrewer_1.1-3 S7_0.2.1
## [19] S4Vectors_0.49.0 lifecycle_1.0.5 compiler_4.6.0
## [22] farver_2.1.2 stringr_1.6.0 Biostrings_2.79.5
## [25] Seqinfo_1.1.0 codetools_0.2-20 htmltools_0.5.9
## [28] sass_0.4.10 yaml_2.3.12 pillar_1.11.1
## [31] crayon_1.5.3 jquerylib_0.1.4 MASS_7.3-65
## [34] cachem_1.1.0 iterators_1.0.14 foreach_1.5.2
## [37] nlme_3.1-169 tidyselect_1.2.1 digest_0.6.39
## [40] stringi_1.8.7 dplyr_1.2.0 reshape2_1.4.5
## [43] bookdown_0.46 splines_4.6.0 ade4_1.7-24
## [46] fastmap_1.2.0 grid_4.6.0 cli_3.6.5
## [49] magrittr_2.0.4 dichromat_2.0-0.1 survival_3.8-6
## [52] ape_5.8-1 withr_3.0.2 scales_1.4.0
## [55] rmarkdown_2.31 XVector_0.51.0 igraph_2.2.2
## [58] multtest_2.67.0 otel_0.2.0 phyloseq_1.55.2
## [61] evaluate_1.0.5 knitr_1.51 IRanges_2.45.0
## [64] mgcv_1.9-4 rlang_1.1.7 Rcpp_1.1.1
## [67] glue_1.8.0 BiocManager_1.30.27 BiocGenerics_0.57.0
## [70] jsonlite_2.0.0 R6_2.6.1 plyr_1.8.9