--- title: "Normalizing scRNA-seq data with Sanity" author: - name: Teo Sakel email: teo@intelligentbiodata.com date: "`r Sys.Date()`" package: SanityR output: BiocStyle::html_document: toc: true toc_float: true bibliography: ref.bib vignette: > %\VignetteIndexEntry{Normalizing scRNA-seq data with Sanity} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse=TRUE, comment="#>", warning=FALSE, message=FALSE ) ``` # Introduction Sanity (SAmpling-Noise-corrected Inference of Transcription activitY) provides a biophysics-inspired normalization method for single-cell RNA-seq data. It uses a Bayesian framework with minimal number of assumptions in order to model physically meaningful quantities and their corresponding uncertainties, facilitating comparisons between cells and experiments. The full details of the method are described in the original publication and its supplementary [@sanity]. # Installation To install this package, start R (version "`r paste0(R.version$major, ".", strsplit(R.version$minor, "\\.")[[1]][1])`") and enter: ```r if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("SanityR") ``` # Quick start The basic functionality of the `r Biocpkg("SanityR")` package is to compute the `logcounts` assay from UMI `counts` of a `r Biocpkg("SingleCellExperiment")` object. It thus acts as a replacement for the `logNormCounts` from `r Biocpkg("scuttle")` on any other function with similar functionality. ```{r quick_start, results='hide'} library(scuttle) library(SanityR) sce <- mockSCE() sce <- computeLibraryFactors(sce) # or any other method to compute size factors sce <- Sanity(sce) logcounts(sce) ``` # Simulate single-cell data To demonstrate the functionality of `r Biocpkg("SanityR")`, we first simulate a single-cell dataset using the method described in the Sanity paper. In short, the data are generated by sampling from the Bayesian prior of the model and the correlation between the cells' gene expression is generated by simulating a random walk process in the gene expression space so that the mean value of transcriptional activity in one cell is close to that of its predecessor in the walk. The number of cells, genes and the other prior parameters are tuned to match the dataset generated in [@baron]. Here we down-sample the number of genes and cells for speed. `r Biocpkg("SanityR")` is designed to work with the `r Biocpkg("SingleCellExperiment")` class and thus the simulation methods return a `SingleCellExperiment` object. The latent ("true") log normalized counts are stored in the `logFC` assay so as to not interfere with the `logcounts` assay. ```{r simulate} library(SingleCellExperiment) library(SanityR) set.seed(42) sce <- simulate_branched_random_walk( N_gene=1000, N_path=10, length_path=12 ) sce ``` Genes with no UMI in any of the cells are removed that is why the number of rows may be less than 1000. Besides the `logFC` assay, the following parameters are stored in the `sce` to fully specify the model: - `ltq_mean`: the mean of the *log transcriptional activity*, defined as the quotient of the number of molecules transcribed from each gene versus the total number of molecules in the cell. - `ltq_var`: the variance of the log transcriptional activity. - `cell_size`: the total number of molecules in the cell. - `predecessor`: the index of the cell that precedes of the current cell in the random walk. Thus if `predecessor[n] == m`, then $m \to n$ in the random walk. # Running Sanity normalization The `Sanity` function is the main function of the package. When called on a `SingleCellExperiment` object, it will perform normalization for each gene and independently and store the results in the `logcounts` assay. ```{r run_sanity} sf <- colSums(counts(sce)) sizeFactors(sce) <- sf / mean(sf) sce <- Sanity(sce) sce ``` Below we visualize the accuracy of the `Sanity` normalization method by partially reproducing Figure 3A of the original publication. First, at estimating the overall mean and variance of a gene expression. ```{r sanity_accuracy, fig.wide=TRUE, fig.cap="Gene Level Parameters. Comparison of Sanity estimates with latent values used to generate the data."} par(mfrow=c(1, 2)) plot( sanity_log_activity_mean ~ ltq_mean, data=rowData(sce), frame=FALSE, pch=16, col="#00000044", xlab="Mean UMI counts", ylab="Sanity Estimate", main="Mean Log Transcriptional Activity" ) abline(0, 1, col=2, lwd=2, lty=2) legend("top", legend="y=x", col=2, lwd=2, lty=2, bty="n") plot( sanity_activity_sd^2 ~ ltq_var, data=rowData(sce), log="xy", frame=FALSE, pch=16, col="#00000044", xlab="True LTQ Variance", ylab="Sanity Estimate", main="Variance of Log Transcriptional Activity" ) abline(0, 1, col=2, lwd=2, lty=2) legend("top", legend="y=x", col=2, lwd=2, lty=2, bty="n") ``` Second, at a cell-level by calculating the correlation of latent and inferred `logcounts` using different using genes of different expression level. ```{r sanity_correlation, fig.cap="Cell Level Estimates. Correlation of Sanity inferred logcounts with latent logcounts using genes at different quantiles of expression."} umi_mean <- rowMeans(counts(sce)) umi_quantiles <- quantile(umi_mean, seq(0, 1, .2)) umi_quantiles <- cut( umi_mean, umi_quantiles, include.lowest=TRUE, labels=paste0("Q", 1:5) ) rho <- sapply( split(1:nrow(sce), umi_quantiles), function(ix) { sce_q <- sce[ix,] diag(cor(assay(sce_q, "logFC"), logcounts(sce_q))) } ) boxplot( rho, frame=FALSE, main="Correlation between latent and inferred logcounts", xlab="Quantiles of Mean UMI Count", ylab="Pearson Correlation" ) ``` # Compute cell distances The `logcounts` calculated by `Sanity` can be used for downstream analyses. Typically, these analyses involve computing distances between cells which can be done by using the functions that `r Biocpkg("BiocNeighbors")` provides. However, `r Biocpkg("SanityR")` provides a more principled approach that takes into account the uncertainty of the estimated log normalized counts. This is implemented in the `calculateSanityDistance` function. Below is an example of using the distances calculated by `calculateSanityDistance` as input to `r CRANpkg("Rtsne")` to embed the cells in a 2D space which captures the underlying random walk structure that generated the data. ```{r calculate_distance, fig.cap="t-SNE visualization using the Sanity calculated distances. Pseudotime is computed based the predecessor relations between cells.", fig.wide=TRUE} cell_dist <- calculateSanityDistance(sce) reducedDim(sce, "TSNE") <- Rtsne::Rtsne(cell_dist, is_distance=TRUE)$Y # Visualize the t-SNE embedding sce$pseudotime <- 0 for (i in 2:ncol(sce)) # cell 1 is the stem cell sce$pseudotime[i] <- sce$pseudotime[sce$predecessor[i]] + 1 scater::plotTSNE(sce, color_by="pseudotime") ``` Unlike many functions of `r Biocpkg("BiocNeighbors")`^[which compute only the `k` nearest neighbors], `calculateSanityDistance` computes *all* the pairwise distances and thus is more resource intensive, both in terms of memory and time. To ease the computational burden, the user can limit the number of genes involved in the calculations by filtering only the highly variable ones. `calculateSanityDistance` uses the signal-to-noise ratio (`snr_cutoff`) internally to filter the genes. # References {.unnumbered} # Session info ```{r session_info} sessionInfo() ```