--- title: "sparseMatrixStats" author: Constantin Ahlmann-Eltze date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{sparseMatrixStats} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Installation You can install the release version of `r BiocStyle::Biocpkg("sparseMatrixStats")` from BioConductor: ``` r if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("sparseMatrixStats") ``` # Introduction The sparseMatrixStats package implements a number of summary functions for sparse matrices from the `r BiocStyle::CRANpkg("Matrix")` package. Let us load the package and create a synthetic sparse matrix. ```{r} library(sparseMatrixStats) # Matrix defines the sparse Matrix class # dgCMatrix that we will use library(Matrix) # For reproducibility set.seed(1) ``` Create a synthetic table with customers, items, and how often they bought that item. ```{r} customer_ids <- seq_len(100) item_ids <- seq_len(30) n_transactions <- 1000 customer <- sample(customer_ids, size = n_transactions, replace = TRUE, prob = runif(100)) item <- sample(item_ids, size = n_transactions, replace = TRUE, prob = runif(30)) tmp <- table(paste0(customer, "-", item)) tmp2 <- strsplit(names(tmp), "-") purchase_table <- data.frame( customer = as.numeric(sapply(tmp2, function(x) x[1])), item = as.numeric(sapply(tmp2, function(x) x[2])), n = as.numeric(tmp) ) head(purchase_table, n = 10) ``` Let us turn the table into a matrix to simplify the analysis: ```{r} purchase_matrix <- sparseMatrix(purchase_table$customer, purchase_table$item, x = purchase_table$n, dims = c(100, 30), dimnames = list(customer = paste0("Customer_", customer_ids), item = paste0("Item_", item_ids))) purchase_matrix[1:10, 1:15] ``` We can see that some customers did not buy anything, where as some bought a lot. `sparseMatrixStats` can help us to identify interesting patterns in this data: ```{r} # How often was each item bough in total? colSums2(purchase_matrix) # What is the range of number of items each # customer bought? head(rowRanges(purchase_matrix)) # What is the variance in the number of items # each customer bought? head(rowVars(purchase_matrix)) # How many items did a customer not buy at all, one time, 2 times, # or exactly 4 times? head(rowTabulates(purchase_matrix, values = c(0, 1, 2, 4))) ``` ## Alternative Matrix Creation In the previous section, I demonstrated how to create a sparse matrix from scratch using the `sparseMatrix()` function. However, often you already have an existing matrix and want to convert it to a sparse representation. ```{r} mat <- matrix(0, nrow=10, ncol=6) mat[sample(seq_len(60), 4)] <- 1:4 # Convert dense matrix to sparse matrix sparse_mat <- as(mat, "dgCMatrix") sparse_mat ``` The *sparseMatrixStats* package is a derivative of the `r BiocStyle::CRANpkg("matrixStats")` package and implements it's API for sparse matrices. For example, to calculate the variance for each column of `mat` you can do ```{r} apply(mat, 2, var) ``` However, this is quite inefficient and *matrixStats* provides the direct function ```{r} matrixStats::colVars(mat) ``` Now for sparse matrices, you can also just call ```{r} sparseMatrixStats::colVars(sparse_mat) ``` # Benchmark If you have a large matrix with many exact zeros, working on the sparse representation can considerably speed up the computations. I generate a dataset with 10,000 rows and 50 columns that is 99% empty ```{r} big_mat <- matrix(0, nrow=1e4, ncol=50) big_mat[sample(seq_len(1e4 * 50), 5000)] <- rnorm(5000) # Convert dense matrix to sparse matrix big_sparse_mat <- as(big_mat, "dgCMatrix") ``` I use the bench package to benchmark the performance difference: ```{r} bench::mark( sparseMatrixStats=sparseMatrixStats::colMedians(big_sparse_mat), matrixStats=matrixStats::colMedians(big_mat), apply=apply(big_mat, 2, median) ) ``` As you can see *sparseMatrixStats* is more than 60 times fast than matrixStats, which in turn is 7 times faster than the `apply()` version. # Session Info ```{r} sessionInfo() ```