---
title: "Efficient Glycan Manipulation with smap"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Efficient Glycan Manipulation with smap}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Overview

This vignette introduces the `smap` family of functions.
These functions are useful when you want to apply custom `igraph`-based operations to glycan structure vectors.

This guide assumes you are comfortable with R programming and have some familiarity with graph concepts.
If you are just getting started,
read the "Getting Started with glyrepr" vignette first.

```{r setup}
library(glyrepr)
```

## Unique Structure Optimization

Before using `smap`, it helps to understand why these functions exist.

### The Problem

Working with glycan structures means working with graphs, 
and graph operations are computationally expensive. 
When you are analyzing thousands of glycans from a large-scale study,
this becomes a real bottleneck.

### The Solution

`glyrepr` implements an optimization called **unique structure storage**.
Instead of storing thousands of identical graphs, 
it stores only the unique ones and keeps track of which original positions they belong to.

Let's see this in action:

```{r}
# Our test data: some common glycan structures
iupacs <- c(
  "Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-",  # N-glycan core
  "Gal(b1-3)GalNAc(a1-",                                    # O-glycan core 1
  "Gal(b1-3)[GlcNAc(b1-6)]GalNAc(a1-",                     # O-glycan core 2
  "Man(a1-3)[Man(a1-6)]Man(a1-3)[Man(a1-6)]Man(a1-",          # Branched mannose
  "GlcNAc6Ac(b1-4)Glc3Me(a1-"                              # With decorations
)

struc <- as_glycan_structure(iupacs)

# Now create a realistic dataset with lots of repetition.
large_struc <- rep(struc, 1000)  # 5,000 total structures
large_struc
```

Notice that the object reports only 5 unique structures.
The vector has 5,000 elements,
but only 5 unique graphs are stored internally.

We can verify that directly:

```{r}
# Only 5 unique graphs are stored internally
length(attr(large_struc, "structures"))

# But we have 5,000 total elements
length(large_struc)
```

### Memory Savings

```{r}
library(lobstr)
obj_sizes(struc, large_struc)
```

The memory difference can be substantial.
For repeated structures,
the optimized representation can be much smaller than storing every graph independently.

## The `smap` Family

There is one important consequence of this internal representation:
regular `lapply()` or `purrr::map()` functions do not operate directly on a glycan structure vector as if it were a list of graphs.

```{r}
# This will not work and will raise an error.
tryCatch(
  purrr::map_int(large_struc, ~ igraph::vcount(.x)),
  error = function(e) cat("Error:", rlang::cnd_message(e))
)
```

**Why does this fail?** 
Because `purrr` functions don't understand the internal structure optimization of `glycan_structure` objects.

### Structure-Aware Mapping

The `smap` functions are structure-aware alternatives to `purrr` mapping functions.
They understand the unique structure optimization and work directly with the underlying graph objects.

```{r}
vertex_counts <- smap_int(large_struc, ~ igraph::vcount(.x))
vertex_counts[1:10]
```

The "s" stands for "structure":
these functions operate on the underlying `igraph` objects that represent glycan structures.

## The `smap` Toolkit

The `smap` family provides glycan-aware equivalents for virtually all `purrr` functions:

| purrr | smap | purrr | smap |
|-------|------|-------|------|
| `map()` | `smap()` | `map2()` | `smap2()` |
| `map_lgl()` | `smap_lgl()` | `map2_lgl()` | `smap2_lgl()` |
| `map_int()` | `smap_int()` | `map2_int()` | `smap2_int()` |
| `map_dbl()` | `smap_dbl()` | `map2_dbl()` | `smap2_dbl()` |
| `map_chr()` | `smap_chr()` | `map2_chr()` | `smap2_chr()` |
| `some()` | `ssome()` | `pmap()` | `spmap()` |
| `every()` | `severy()` | `pmap_*()` | `spmap_*()` |
| `none()` | `snone()` | `imap()` | `simap()` |
| | | `imap_*()` | `simap_*()` |

As a simple rule, replace `map` with `smap`,
`pmap` with `spmap`, 
and `imap` with `simap`. 
The function signatures are designed to feel familiar if you already use `purrr`.

### Basic Examples

**Count vertices in each structure:**
```{r}
vertex_counts <- smap_int(large_struc, igraph::vcount)
summary(vertex_counts)
```

**Find structures with more than 4 vertices:**
```{r}
has_many_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) > 4)
sum(has_many_vertices)
```

**Get the degree sequence of each structure:**
```{r}
degree_sequences <- smap(large_struc, ~ igraph::degree(.x))
degree_sequences[1:3]
```

**Check if any structure has isolated vertices:**
```{r}
ssome(large_struc, ~ any(igraph::degree(.x) == 0))
```

**Verify all structures are connected:**
```{r}
severy(large_struc, ~ igraph::is_connected(.x))
```

### Beyond Basic `smap()`

Quick examples of the extended family:

```{r}
# smap2: Apply function with additional parameters
thresholds <- c(3, 4, 5)
large_enough <- smap2_lgl(struc[1:3], thresholds, function(g, threshold) {
  igraph::vcount(g) >= threshold
})
large_enough
```

```{r}
# simap: Include position information
indexed_report <- simap_chr(large_struc[1:3], function(g, i) {
  paste0("#", i, ": ", igraph::vcount(g), " vertices")
})
indexed_report
```

**Performance note:**
`simap` functions do not benefit from the unique structure optimization.
Since each element has a different index,
the combination of `(structure, index)` is always unique, 
breaking the deduplication that makes other `smap` functions fast. 
Use `simap` only when you truly need position information.

## Performance

The main performance benefit of `smap` functions comes from automatic deduplication:

```{r}
# Create a large dataset with high redundancy
huge_struc <- rep(struc, 5000)  # 25,000 structures, only 5 unique

cat("Dataset size:", length(huge_struc), "structures\n")
cat("Unique structures:", length(attr(huge_struc, "structures")), "\n")
cat("Redundancy factor:", length(huge_struc) / length(attr(huge_struc, "structures")), "x\n")

library(tictoc)

# Optimized approach: smap only processes 5 unique structures
tic("smap_int (optimized)")
vertex_counts_optimized <- smap_int(huge_struc, igraph::vcount)
toc()

# Naive approach: extract all graphs and process each one
tic("Naive approach (all graphs)")
all_graphs <- get_structure_graphs(huge_struc)  # Extracts all 25,000 graphs
vertex_counts_naive <- purrr::map_int(all_graphs, igraph::vcount)
toc()

# Verify results are equivalent (though data types may differ)
all.equal(vertex_counts_optimized, vertex_counts_naive)
```

The higher the redundancy, the larger the performance gain.
In real glycoproteomics datasets with repeated structures, 
this optimization can provide about 10x speedups.

## Additional Patterns

### Working with Complex Functions

The function you pass to `smap` must accept an `igraph` object as its first argument. 
You can use purrr-style lambda notation:

```{r}
# Calculate clustering coefficient for each structure
clustering_coeffs <- smap_dbl(large_struc, ~ igraph::transitivity(.x, type = "global"))
summary(clustering_coeffs)
```

### Combining Multiple Metrics

```{r}
# Create a compact structure summary.
structure_analysis <- smap(large_struc, function(g) {
  list(
    vertices = igraph::vcount(g),
    edges = igraph::ecount(g),
    diameter = ifelse(igraph::is_connected(g), igraph::diameter(g), NA),
    clustering = igraph::transitivity(g, type = "global")
  )
})

# Convert to a more usable format
analysis_df <- do.call(rbind, lapply(structure_analysis, data.frame))
head(analysis_df)
```

### Memory-Efficient Filtering

```{r}
# Find only structures with exactly 5 vertices
has_five_vertices <- smap_lgl(large_struc, ~ igraph::vcount(.x) == 5)
five_vertex_structures <- large_struc[has_five_vertices]

cat("Found", sum(has_five_vertices), "structures with exactly 5 vertices\n")
```

## When to Use `smap` Functions

**Use `smap` functions when:**

- You need to apply `igraph`-based functions to glycan structures.
- You want better performance with datasets containing repeated structures.
- You are building custom glycan analysis pipelines.

**Use regular R functions when:**

- You are working with compositions.
- You are operating on string representations.

**Special note on `simap`:**

While `simap` functions are convenient for position-aware operations, 
they do not provide performance benefits over regular `imap` functions.
The inclusion of index information breaks the unique structure optimization, 
making each `(structure, index)` pair unique even when structures are identical.

## Example: Custom Motif Detection

Here's how you might build a custom glycan analysis pipeline using `smap` functions:

```{r}
# Custom motif detector
detect_branching <- function(g) {
  degrees <- igraph::degree(g)
  any(degrees >= 3)
}

# Apply to a large dataset using unique structure optimization.
has_branching <- smap_lgl(large_struc, detect_branching)
cat("Structures with branching:", sum(has_branching), "out of", length(large_struc), "\n")

# Use smap2 to check structures against complexity thresholds
complexity_thresholds <- rep(c(3, 4, 5, 2, 4), 1000)  # Thresholds for each structure
meets_threshold <- smap2_lgl(large_struc, complexity_thresholds, function(g, threshold) {
  igraph::vcount(g) >= threshold
})
cat("Structures meeting complexity threshold:", sum(meets_threshold), "out of", length(large_struc), "\n")
```

## Summary

The `smap` family provides structure-aware mapping functions for glycan structure vectors.
It lets you write custom graph-based analyses while preserving the unique structure optimization used by `glyrepr`.

Key takeaways:

- Unique structure optimization stores repeated structures efficiently.
- `smap` functions are `purrr`-like tools that understand glycan structure vectors.
- Performance gains are strongest when datasets contain repeated structures.
- Use `smap` for structures, and use regular R or `purrr` functions for other data types.

## Session Information

```{r}
sessionInfo()
```