--- title: "Efficient Glycan Manipulation with smap" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Efficient Glycan Manipulation with smap} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Overview This vignette introduces the `smap` family of functions. These functions are useful when you want to apply custom `igraph`-based operations to glycan structure vectors. This guide assumes you are comfortable with R programming and have some familiarity with graph concepts. If you are just getting started, read the "Getting Started with glyrepr" vignette first. ```{r setup} library(glyrepr) ``` ## Unique Structure Optimization Before using `smap`, it helps to understand why these functions exist. ### The Problem Working with glycan structures means working with graphs, and graph operations are computationally expensive. When you are analyzing thousands of glycans from a large-scale study, this becomes a real bottleneck. ### The Solution `glyrepr` implements an optimization called **unique structure storage**. Instead of storing thousands of identical graphs, it stores only the unique ones and keeps track of which original positions they belong to. Let's see this in action: ```{r} # Our test data: some common glycan structures iupacs <- c( "Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-", # N-glycan core "Gal(b1-3)GalNAc(a1-", # O-glycan core 1 "Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-", # O-glycan core 2 "Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-", # Branched mannose "GlcNAc6Ac(b1-4)Glc3Me(a1-" # With decorations ) struc <- as_glycan_structure(iupacs) # Now create a realistic dataset with lots of repetition. large_struc <- rep(struc, 1000) # 5,000 total structures large_struc ``` Notice that the object reports only 5 unique structures. The vector has 5,000 elements, but only 5 unique graphs are stored internally. We can verify that directly: ```{r} # Only 5 unique graphs are stored internally length(attr(large_struc, "structures")) # But we have 5,000 total elements length(large_struc) ``` ### Memory Savings ```{r} library(lobstr) obj_sizes(struc, large_struc) ``` The memory difference can be substantial. For repeated structures, the optimized representation can be much smaller than storing every graph independently. ## The `smap` Family There is one important consequence of this internal representation: regular `lapply()` or `purrr::map()` functions do not operate directly on a glycan structure vector as if it were a list of graphs. ```{r} # This will not work and will raise an error. tryCatch( purrr::map_int(large_struc, ~ igraph::vcount(.x)), error = function(e) cat("Error:", rlang::cnd_message(e)) ) ``` **Why does this fail?** Because `purrr` functions don't understand the internal structure optimization of `glycan_structure` objects. ### Structure-Aware Mapping The `smap` functions are structure-aware alternatives to `purrr` mapping functions. They understand the unique structure optimization and work directly with the underlying graph objects. ```{r} vertex_counts <- smap_int(large_struc, ~ igraph::vcount(.x)) vertex_counts[1:10] ``` The "s" stands for "structure": these functions operate on the underlying `igraph` objects that represent glycan structures. ## The `smap` Toolkit The `smap` family provides glycan-aware equivalents for virtually all `purrr` functions: | purrr | smap | purrr | smap | |-------|------|-------|------| | `map()` | `smap()` | `map2()` | `smap2()` | | `map_lgl()` | `smap_lgl()` | `map2_lgl()` | `smap2_lgl()` | | `map_int()` | `smap_int()` | `map2_int()` | `smap2_int()` | | `map_dbl()` | `smap_dbl()` | `map2_dbl()` | `smap2_dbl()` | | `map_chr()` | `smap_chr()` | `map2_chr()` | `smap2_chr()` | | `some()` | `ssome()` | `pmap()` | `spmap()` | | `every()` | `severy()` | `pmap_*()` | `spmap_*()` | | `none()` | `snone()` | `imap()` | `simap()` | | | | `imap_*()` | `simap_*()` | As a simple rule, replace `map` with `smap`, `pmap` with `spmap`, and `imap` with `simap`. The function signatures are designed to feel familiar if you already use `purrr`. ### Basic Examples **Count vertices in each structure:** ```{r} vertex_counts <- smap_int(large_struc, igraph::vcount) summary(vertex_counts) ``` **Find structures with more than 4 vertices:** ```{r} has_many_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) > 4) sum(has_many_vertices) ``` **Get the degree sequence of each structure:** ```{r} degree_sequences <- smap(large_struc, ~ igraph::degree(.x)) degree_sequences[1:3] ``` **Check if any structure has isolated vertices:** ```{r} ssome(large_struc, ~ any(igraph::degree(.x) == 0)) ``` **Verify all structures are connected:** ```{r} severy(large_struc, ~ igraph::is_connected(.x)) ``` ### Beyond Basic `smap()` Quick examples of the extended family: ```{r} # smap2: Apply function with additional parameters thresholds <- c(3, 4, 5) large_enough <- smap2_lgl(struc[1:3], thresholds, function(g, threshold) { igraph::vcount(g) >= threshold }) large_enough ``` ```{r} # simap: Include position information indexed_report <- simap_chr(large_struc[1:3], function(g, i) { paste0("#", i, ": ", igraph::vcount(g), " vertices") }) indexed_report ``` **Performance note:** `simap` functions do not benefit from the unique structure optimization. Since each element has a different index, the combination of `(structure, index)` is always unique, breaking the deduplication that makes other `smap` functions fast. Use `simap` only when you truly need position information. ## Performance The main performance benefit of `smap` functions comes from automatic deduplication: ```{r} # Create a large dataset with high redundancy huge_struc <- rep(struc, 5000) # 25,000 structures, only 5 unique cat("Dataset size:", length(huge_struc), "structures\n") cat("Unique structures:", length(attr(huge_struc, "structures")), "\n") cat("Redundancy factor:", length(huge_struc) / length(attr(huge_struc, "structures")), "x\n") library(tictoc) # Optimized approach: smap only processes 5 unique structures tic("smap_int (optimized)") vertex_counts_optimized <- smap_int(huge_struc, igraph::vcount) toc() # Naive approach: extract all graphs and process each one tic("Naive approach (all graphs)") all_graphs <- get_structure_graphs(huge_struc) # Extracts all 25,000 graphs vertex_counts_naive <- purrr::map_int(all_graphs, igraph::vcount) toc() # Verify results are equivalent (though data types may differ) all.equal(vertex_counts_optimized, vertex_counts_naive) ``` The higher the redundancy, the larger the performance gain. In real glycoproteomics datasets with repeated structures, this optimization can provide about 10x speedups. ## Additional Patterns ### Working with Complex Functions The function you pass to `smap` must accept an `igraph` object as its first argument. You can use purrr-style lambda notation: ```{r} # Calculate clustering coefficient for each structure clustering_coeffs <- smap_dbl(large_struc, ~ igraph::transitivity(.x, type = "global")) summary(clustering_coeffs) ``` ### Combining Multiple Metrics ```{r} # Create a compact structure summary. structure_analysis <- smap(large_struc, function(g) { list( vertices = igraph::vcount(g), edges = igraph::ecount(g), diameter = ifelse(igraph::is_connected(g), igraph::diameter(g), NA), clustering = igraph::transitivity(g, type = "global") ) }) # Convert to a more usable format analysis_df <- do.call(rbind, lapply(structure_analysis, data.frame)) head(analysis_df) ``` ### Memory-Efficient Filtering ```{r} # Find only structures with exactly 5 vertices has_five_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) == 5) five_vertex_structures <- large_struc[has_five_vertices] cat("Found", sum(has_five_vertices), "structures with exactly 5 vertices\n") ``` ## When to Use `smap` Functions **Use `smap` functions when:** - You need to apply `igraph`-based functions to glycan structures. - You want better performance with datasets containing repeated structures. - You are building custom glycan analysis pipelines. **Use regular R functions when:** - You are working with compositions. - You are operating on string representations. **Special note on `simap`:** While `simap` functions are convenient for position-aware operations, they do not provide performance benefits over regular `imap` functions. The inclusion of index information breaks the unique structure optimization, making each `(structure, index)` pair unique even when structures are identical. ## Example: Custom Motif Detection Here's how you might build a custom glycan analysis pipeline using `smap` functions: ```{r} # Custom motif detector detect_branching <- function(g) { degrees <- igraph::degree(g) any(degrees >= 3) } # Apply to a large dataset using unique structure optimization. has_branching <- smap_lgl(large_struc, detect_branching) cat("Structures with branching:", sum(has_branching), "out of", length(large_struc), "\n") # Use smap2 to check structures against complexity thresholds complexity_thresholds <- rep(c(3, 4, 5, 2, 4), 1000) # Thresholds for each structure meets_threshold <- smap2_lgl(large_struc, complexity_thresholds, function(g, threshold) { igraph::vcount(g) >= threshold }) cat("Structures meeting complexity threshold:", sum(meets_threshold), "out of", length(large_struc), "\n") ``` ## Summary The `smap` family provides structure-aware mapping functions for glycan structure vectors. It lets you write custom graph-based analyses while preserving the unique structure optimization used by `glyrepr`. Key takeaways: - Unique structure optimization stores repeated structures efficiently. - `smap` functions are `purrr`-like tools that understand glycan structure vectors. - Performance gains are strongest when datasets contain repeated structures. - Use `smap` for structures, and use regular R or `purrr` functions for other data types. ## Session Information ```{r} sessionInfo() ```