scater 1.16.2
This document gives an introduction to and overview of the quality control functionality of the scater package.
scater contains tools to help with the analysis of single-cell transcriptomic data,
focusing on low-level steps such as quality control, normalization and visualization.
It is based on the SingleCellExperiment
class (from the SingleCellExperiment package),
and thus is interoperable with many other Bioconductor packages such as scran, batchelor and iSEE.
Note: A more comprehensive description of the use of scater (along with other packages) in a scRNA-seq analysis workflow is available at https://osca.bioconductor.org.
SingleCellExperiment
objectWe assume that you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.).
First, we create a SingleCellExperiment
object containing the data, as demonstrated below with a famous brain dataset.
Rows of the object correspond to features, while columns correspond to samples, i.e., cells in the context of single-cell ’omics data.
library(scRNAseq)
example_sce <- ZeiselBrainData()
example_sce
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(10): tissue group # ... level1class level2class
## reducedDimNames(0):
## altExpNames(2): ERCC repeat
We usually expect (raw) count data to be labelled as "counts"
in the assays, which can be easily retrieved with the counts
accessor.
Getters and setters are also provided for exprs
, tpm
, cpm
, fpkm
and versions of these with the prefix norm_
.
str(counts(example_sce))
Row and column-level metadata are easily accessed (or modified) as shown below.
There are also dedicated getters and setters for size factor values (sizeFactors()
); reduced dimensionality results (reducedDim()
); and alternative experimental features (altExp()
).
example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE)
colData(example_sce)
## DataFrame with 3005 rows and 11 columns
## tissue group # total mRNA mol well sex
## <character> <numeric> <numeric> <numeric> <numeric>
## 1772071015_C02 sscortex 1 1221 3 3
## 1772071017_G12 sscortex 1 1231 95 1
## 1772071017_A05 sscortex 1 1652 27 1
## 1772071014_B06 sscortex 1 1696 37 3
## 1772067065_H06 sscortex 1 1219 43 3
## ... ... ... ... ... ...
## 1772067059_B04 ca1hippocampus 9 1997 19 1
## 1772066097_D04 ca1hippocampus 9 1415 21 1
## 1772063068_D01 sscortex 9 1876 34 3
## 1772066098_A12 ca1hippocampus 9 1546 88 1
## 1772058148_F03 sscortex 9 1970 15 3
## age diameter cell_id level1class level2class
## <numeric> <numeric> <character> <character> <character>
## 1772071015_C02 2 1 1772071015_C02 interneurons Int10
## 1772071017_G12 1 353 1772071017_G12 interneurons Int10
## 1772071017_A05 1 13 1772071017_A05 interneurons Int6
## 1772071014_B06 2 19 1772071014_B06 interneurons Int10
## 1772067065_H06 6 12 1772067065_H06 interneurons Int9
## ... ... ... ... ... ...
## 1772067059_B04 4 382 1772067059_B04 endothelial-mural Peric
## 1772066097_D04 7 12 1772066097_D04 endothelial-mural Vsmc
## 1772063068_D01 7 268 1772063068_D01 endothelial-mural Vsmc
## 1772066098_A12 7 324 1772066098_A12 endothelial-mural Vsmc
## 1772058148_F03 7 6 1772058148_F03 endothelial-mural Vsmc
## whee
## <character>
## 1772071015_C02 B
## 1772071017_G12 F
## 1772071017_A05 A
## 1772071014_B06 H
## 1772067065_H06 X
## ... ...
## 1772067059_B04 P
## 1772066097_D04 T
## 1772063068_D01 H
## 1772066098_A12 K
## 1772058148_F03 E
rowData(example_sce)$stuff <- runif(nrow(example_sce))
rowData(example_sce)
## DataFrame with 20006 rows and 2 columns
## featureType stuff
## <character> <numeric>
## Tspan12 endogenous 0.776738
## Tshz1 endogenous 0.531341
## Fnbp1l endogenous 0.245747
## Adamts15 endogenous 0.841682
## Cldn12 endogenous 0.476325
## ... ... ...
## mt-Co2 mito 0.390712
## mt-Co1 mito 0.542127
## mt-Rnr2 mito 0.915390
## mt-Rnr1 mito 0.665484
## mt-Nd4l mito 0.612729
Subsetting is very convenient with this class, as both data and metadata are processed in a synchronized manner.
More details about the SingleCellExperiment
class can be found in the documentation for SingleCellExperiment package.
Count matrices stored as CSV files or equivalent can be easily read into R session using read.table()
from utils or fread()
from the data.table package.
It is advisable to coerce the resulting object into a matrix before storing it in a SingleCellExperiment
object.
For large data sets, the matrix can be read in chunk-by-chunk with progressive coercion into a sparse matrix from the Matrix package.
This is performed using the readSparseCounts()
function and reduces memory usage by not explicitly storing zeroes in memory.
Data from 10X Genomics experiments can be read in using the read10xCounts
function from the DropletUtils package.
This will automatically generate a SingleCellExperiment
with a sparse matrix, see the documentation for more details.
Transcript abundances from the kallisto
and Salmon
pseudo-aligners can be imported using methods from the tximeta package.
This produces a SummarizedExperiment
object that can be coerced into a SingleCellExperiment
simply with as(se, "SingleCellExperiment")
.
scater provides functionality for three levels of quality control (QC):
Cell-level metrics are computed by the perCellQCMetrics()
function and include:
sum
: total number of counts for the cell (i.e., the library size).detected
: the number of features for the cell that have counts above the detection limit (default of zero).subsets_X_percent
: percentage of all counts that come from the feature control set named X
.library(scater)
per.cell <- perCellQCMetrics(example_sce,
subsets=list(Mito=grep("mt-", rownames(example_sce))))
summary(per.cell$sum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2574 8130 12913 14954 19284 63505
summary(per.cell$detected)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 785 2484 3656 3777 4929 8167
summary(per.cell$subsets_Mito_percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.992 6.653 7.956 10.290 56.955
It is often convenient to store this in the colData()
of our SingleCellExperiment
object for future reference.
(In fact, the addPerCellQC()
function will do this automatically.)
colData(example_sce) <- cbind(colData(example_sce), per.cell)
Metadata variables can be plotted against each other using the plotColData()
function, as shown below.
We expect to see an increasing number of detected genes with increasing total count.
Each point represents a cell that is coloured according to its tissue of origin.
plotColData(example_sce, x = "sum", y="detected", colour_by="tissue")
Here, we have plotted the total count for each cell against the mitochondrial content. Well-behaved cells should have a large number of expressed features and and a low percentage of expression from feature controls. High percentage expression from feature controls and few expressed features are indicative of blank and failed cells. For some variety, we have faceted by the tissue of origin.
plotColData(example_sce, x = "sum", y="subsets_Mito_percent",
other_fields="tissue") + facet_wrap(~tissue)
Column subsetting of the SingeCellExperiment
object will only retain the selected cells, thus removing low-quality or otherwise unwanted cells.
We can identify high-quality cells to retain by setting a fixed threshold on particular metrics.
For example, we could retain only cells that have at least 100,000 total counts and at least 500 expressed features:
keep.total <- example_sce$sum > 1e5
keep.n <- example_sce$detected > 500
filtered <- example_sce[,keep.total & keep.n]
dim(filtered)
## [1] 20006 0
The isOutlier
function provides a more data-adaptive way of choosing these thresholds.
This defines the threshold at a certain number of median absolute deviations (MADs) away from the median.
Values beyond this threshold are considered outliers and can be filtered out, assuming that they correspond to low-quality cells.
Here, we define small outliers (using type="lower"
) for the log-total counts at 3 MADs from the median.
keep.total <- isOutlier(per.cell$sum, type="lower", log=TRUE)
filtered <- example_sce[,keep.total]
Detection of outliers can be achieved more conveniently for several common metrics using the quickPerCellQC()
function.
This uses the total count, number of detected features and the percentage of counts in gene sets of diagnostic value
(e.g., mitochondrial genes, spike-in transcripts) to identify which cells to discard and for what reason.
qc.stats <- quickPerCellQC(per.cell, percent_subsets="subsets_Mito_percent")
colSums(as.matrix(qc.stats))
## low_lib_size low_n_features high_subsets_Mito_percent
## 0 3 128
## discard
## 131
filtered <- example_sce[,!qc.stats$discard]
The isOutlier
approach adjusts to experiment-specific aspects of the data, e.g., sequencing depth, amount of spike-in RNA added, cell type.
In contrast, a fixed threshold would require manual adjustment to account for changes to the experimental protocol or system.
We refer readers to the simpleSingleCell workflow for more details.
Feature-level metrics are computed by the perFeatureQCMetrics()
function and include:
mean
: the mean count of the gene/feature across all cells.detected
: the percentage of cells with non-zero counts for each gene.subsets_Y_ratio
: ratio of mean counts between the cell control set named Y and all cells.# Pretending that the first 10 cells are empty wells, for demonstration.
per.feat <- perFeatureQCMetrics(example_sce, subsets=list(Empty=1:10))
summary(per.feat$mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007 0.0097 0.1338 0.7475 0.5763 732.1524
summary(per.feat$detected)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03328 0.76539 9.01830 18.87800 31.24792 99.96672
summary(per.feat$subsets_Empty_ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.601 1.872 2.016 300.500
A more refined calculation of the average is provided by the calculateAverage()
function,
which adjusts the counts by the relative library size (or size factor) prior to taking the mean.
ave <- calculateAverage(example_sce)
summary(ave)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0002 0.0109 0.1443 0.7475 0.5674 850.6880
We can also compute the number of cells expressing a gene directly.
summary(nexprs(example_sce, byrow=TRUE))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 23.0 271.0 567.3 939.0 3004.0
We look at a plot that shows the top 50 (by default) most-expressed features.
Each row in the plot below corresponds to a gene, and each bar corresponds to the expression of a gene in a single cell.
The circle indicates the median expression of each gene, with which genes are sorted.
By default, “expression” is defined using the feature counts (if available), but other expression values can be used instead by changing exprs_values
.
plotHighestExprs(example_sce, exprs_values = "counts")
We expect to see the “usual suspects”, i.e., mitochondrial genes, actin, ribosomal protein, MALAT1. A few spike-in transcripts may also be present here, though if all of the spike-ins are in the top 50, it suggests that too much spike-in RNA was added. A large number of pseudo-genes or predicted genes may indicate problems with alignment.
Genes can be removed by row subsetting of the SingleCellExperiment
object.
For example, we can filter out features (genes) that are not expressed in any cells:
keep_feature <- nexprs(example_sce, byrow=TRUE) > 0
example_sce <- example_sce[keep_feature,]
dim(example_sce)
## [1] 20006 3005
Other filtering can be done using existing annotation. For example, ribosomal protein genes and predicted genes can be identified (and removed) using regular expressions or biotype information. Such genes are often uninteresting when the aim is to characterize population heterogeneity.
Variable-level metrics are computed by the getVarianceExplained()
function (after normalization, see below).
This calculates the percentage of variance of each gene’s expression that is explained by each variable in the colData
of the SingleCellExperiment
object.
example_sce <- logNormCounts(example_sce) # see below.
vars <- getVarianceExplained(example_sce,
variables=c("tissue", "total mRNA mol", "sex", "age"))
head(vars)
## tissue total mRNA mol sex age
## Tspan12 0.02207262 0.074086504 0.146344996 0.09472155
## Tshz1 3.36083014 0.003846487 0.001079356 0.31262288
## Fnbp1l 0.43597185 0.421086301 0.003071630 0.64964174
## Adamts15 0.54233888 0.005348505 0.030821621 0.01393787
## Cldn12 0.03506751 0.309128294 0.008341408 0.02363737
## Rxfp1 0.18559637 0.016290703 0.055646799 0.02128006
We can then use this to determine which experimental factors are contributing most to the variance in expression. This is useful for diagnosing batch effects or to quickly verify that a treatment has an effect.
plotExplanatoryVariables(vars)
The most commonly used function is logNormCounts()
, which calculates log2-transformed normalized expression values.
This is done by dividing each count by its size factor, adding a pseudo-count and log-transforming.
The resulting values can be interpreted on the same scale as log-transformed counts, and are stored in "logcounts"
.
example_sce <- logNormCounts(example_sce)
assayNames(example_sce)
## [1] "counts" "logcounts"
By default, the size factor is automatically computed from the library size of each cell using the librarySizeFactors()
function.
This calculation simply involves scaling the library sizes so that they have a mean of 1 across all cells.
However, if size factors are explicitly provided in the SingleCellExperiment
, they will be used by the normalization functions.
summary(librarySizeFactors(example_sce))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1721 0.5437 0.8635 1.0000 1.2895 4.2466
Alternatively, we can calculate counts-per-million using the aptly-named calculateCPM()
function.
The output is most appropriately stored as an assay named "cpm"
in the assays of the SingleCellExperiment
object.
Related functions include calculateTPM()
and calculateFPKM()
, which do pretty much as advertised.
cpm(example_sce) <- calculateCPM(example_sce)
Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay.
assay(example_sce, "normed") <- normalizeCounts(example_sce,
size_factors=runif(ncol(example_sce)), pseudo_count=1.5)
The aggregateAcrossCells()
function is helpful for aggregating expression values across groups of cells.
For example, we might wish to sum together counts for all cells in the same cluster,
possibly to use as a summary statistic for downstream analyses (e.g., for differential expression with edgeR).
This will also perform the courtesy of sensibly aggregating the column metadata for downstream use.
agg_sce <- aggregateAcrossCells(example_sce, ids=example_sce$level1class)
head(assay(agg_sce))
## astrocytes_ependymal endothelial-mural interneurons microglia
## Tspan12 74 138 186 14
## Tshz1 96 105 335 43
## Fnbp1l 89 169 689 33
## Adamts15 2 23 64 0
## Cldn12 50 27 160 5
## Rxfp1 0 3 74 0
## oligodendrocytes pyramidal CA1 pyramidal SS
## Tspan12 99 269 58
## Tshz1 297 65 332
## Fnbp1l 239 859 1056
## Adamts15 19 11 15
## Cldn12 214 411 273
## Rxfp1 13 48 30
colData(agg_sce)[,c("ids", "ncells")]
## DataFrame with 7 rows and 2 columns
## ids ncells
## <character> <integer>
## astrocytes_ependymal astrocytes_ependymal 224
## endothelial-mural endothelial-mural 235
## interneurons interneurons 290
## microglia microglia 98
## oligodendrocytes oligodendrocytes 820
## pyramidal CA1 pyramidal CA1 939
## pyramidal SS pyramidal SS 399
It is similarly possible to sum across multiple factors, as shown below for the cell type and the tissue of origin. This yields one column per combination of cell type and tissue, which allows us to conveniently perform downstream analyses with both factors.
agg_sce <- aggregateAcrossCells(example_sce,
ids=colData(example_sce)[,c("level1class", "tissue")])
head(assay(agg_sce))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## Tspan12 23 51 10 128 52 134 0 14 10 89 269 58
## Tshz1 38 58 10 95 90 245 1 42 50 247 65 332
## Fnbp1l 23 66 9 160 227 462 3 30 18 221 859 1056
## Adamts15 0 2 2 21 15 49 0 0 0 19 11 15
## Cldn12 5 45 0 27 61 99 0 5 22 192 411 273
## Rxfp1 0 0 0 3 3 71 0 0 1 12 48 30
colData(agg_sce)[,c("level1class", "tissue", "ncells")]
## DataFrame with 12 rows and 3 columns
## level1class tissue ncells
## <character> <character> <integer>
## 1 astrocytes_ependymal ca1hippocampus 81
## 2 astrocytes_ependymal sscortex 143
## 3 endothelial-mural ca1hippocampus 33
## 4 endothelial-mural sscortex 202
## 5 interneurons ca1hippocampus 126
## ... ... ... ...
## 8 microglia sscortex 84
## 9 oligodendrocytes ca1hippocampus 121
## 10 oligodendrocytes sscortex 699
## 11 pyramidal CA1 ca1hippocampus 939
## 12 pyramidal SS sscortex 399
Summation across rows may occasionally be useful for obtaining a measure of the activity of a gene set, e.g., in a pathway.
Given a list of gene sets, we can use the sumCountsAcrossFeatures()
function to aggregate expression values across features.
This is usually best done by averaging the log-expression values as shown below.
agg_feat <- sumCountsAcrossFeatures(example_sce,
ids=list(GeneSet1=1:10, GeneSet2=11:50, GeneSet3=1:100),
average=TRUE, exprs_values="logcounts")
agg_feat[,1:10]
## 1772071015_C02 1772071017_G12 1772071017_A05 1772071014_B06
## GeneSet1 0.7717764 0.2177634 0.7577241 0.5380195
## GeneSet2 1.0602489 0.9955551 1.2558629 1.1482845
## GeneSet3 1.1410273 1.0018648 1.2727650 1.1995992
## 1772067065_H06 1772071017_E02 1772067065_B07 1772067060_B09
## GeneSet1 0.4987205 0.3528216 0.5262005 0.0749138
## GeneSet2 1.2010854 1.2855797 1.2889492 1.1806065
## GeneSet3 1.1244981 1.1567887 1.1057770 1.1353704
## 1772071014_E04 1772071015_D04
## GeneSet1 0.7267285 0.7924899
## GeneSet2 1.1728326 1.2871506
## GeneSet3 1.0098231 1.1846439
Similar functions are available to compute the number or proportion of cells with detectable expression in each group.
The plotExpression()
function makes it easy to plot expression values for a subset of genes or features.
This can be particularly useful for further examination of features identified from differential expression testing, pseudotime analysis or other analyses.
By default, it uses expression values in the "logcounts"
assay, but this can be changed through the exprs_values
argument.
plotExpression(example_sce, rownames(example_sce)[1:6], x = "level1class")
Setting x
will determine the covariate to be shown on the x-axis.
This can be a field in the column metadata or the name of a feature (to obtain the expression profile across cells).
Categorical covariates will yield grouped violins as shown above, with one panel per feature.
By comparison, continuous covariates will generate a scatter plot in each panel, as shown below.
plotExpression(example_sce, rownames(example_sce)[1:6],
x = rownames(example_sce)[10])
The points can also be coloured, shaped or resized by the column metadata or expression values.
plotExpression(example_sce, rownames(example_sce)[1:6],
x = "level1class", colour_by="tissue")
Directly plotting the gene expression without any x
or other visual parameters will generate a set of grouped violin plots, coloured in an aesthetically pleasing manner.
plotExpression(example_sce, rownames(example_sce)[1:6])
Principal components analysis (PCA) is often performed to denoise and compact the data prior to downstream analyses.
The runPCA()
function provides a simple wrapper around the base machinery in BiocSingular for computing PCs from log-transformed expression values.
This stores the output in the reducedDims
slot of the SingleCellExperiment
, which can be easily retrieved (along with the percentage of variance explained by each PC) as shown below:
example_sce <- runPCA(example_sce)
str(reducedDim(example_sce, "PCA"))
## num [1:3005, 1:50] 15.4 15 17.2 16.9 18.4 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3005] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06" ...
## ..$ : chr [1:50] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "percentVar")= num [1:50] 39.72 9.38 4.25 3.9 2.76 ...
## - attr(*, "rotation")= num [1:500, 1:50] -0.1471 -0.1146 -0.1084 -0.0958 -0.0953 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:500] "Plp1" "Trf" "Mal" "Apod" ...
## .. ..$ : chr [1:50] "PC1" "PC2" "PC3" "PC4" ...
By default, runPCA()
uses the top 500 genes with the highest variances to compute the first PCs.
This can be tuned by specifying subset_row
to pass in an explicit set of genes of interest,
and by using ncomponents
to determine the number of components to compute.
The name
argument can also be used to change the name of the result in the reducedDims
slot.
example_sce <- runPCA(example_sce, name="PCA2",
subset_row=rownames(example_sce)[1:1000],
ncomponents=25)
str(reducedDim(example_sce, "PCA2"))
## num [1:3005, 1:25] 20 21 23 23.7 21.5 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3005] "1772071015_C02" "1772071017_G12" "1772071017_A05" "1772071014_B06" ...
## ..$ : chr [1:25] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "percentVar")= num [1:25] 22.3 5.11 3.42 1.69 1.58 ...
## - attr(*, "rotation")= num [1:1000, 1:25] 2.24e-04 8.52e-05 2.43e-02 5.92e-04 6.35e-03 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:1000] "Tspan12" "Tshz1" "Fnbp1l" "Adamts15" ...
## .. ..$ : chr [1:25] "PC1" "PC2" "PC3" "PC4" ...
\(t\)-distributed stochastic neighbour embedding (\(t\)-SNE) is widely used for visualizing complex single-cell data sets.
The same procedure described for PCA plots can be applied to generate \(t\)-SNE plots using plotTSNE
, with coordinates obtained using runTSNE
via the Rtsne package.
We strongly recommend generating plots with different random seeds and perplexity values, to ensure that any conclusions are robus
t to different visualizations.
# Perplexity of 10 just chosen here arbitrarily.
set.seed(1000)
example_sce <- runTSNE(example_sce, perplexity=10)
head(reducedDim(example_sce, "TSNE"))
## [,1] [,2]
## 1772071015_C02 -49.70928 -12.34376
## 1772071017_G12 -52.60643 -10.83418
## 1772071017_A05 -49.34541 -12.32723
## 1772071014_B06 -52.17084 -10.59969
## 1772067065_H06 -51.75725 -10.20202
## 1772071017_E02 -52.71614 -10.29159
A more common pattern involves using the pre-existing PCA results as input into the \(t\)-SNE algorithm. This is useful as it improves speed by using a low-rank approximation of the expression matrix; and reduces random noise, by focusing on the major factors of variation. The code below uses the first 10 dimensions of the previously computed PCA result to perform the \(t\)-SNE.
set.seed(1000)
example_sce <- runTSNE(example_sce, perplexity=50,
dimred="PCA", n_dimred=10)
head(reducedDim(example_sce, "TSNE"))
## [,1] [,2]
## 1772071015_C02 -38.05331 0.9025655
## 1772071017_G12 -36.02916 -0.1065038
## 1772071017_A05 -38.45340 1.3981213
## 1772071014_B06 -38.64621 -0.2184595
## 1772067065_H06 -41.02368 -0.8125667
## 1772071017_E02 -38.60500 2.6343767
The same can be done for uniform manifold with approximate projection (UMAP) via the runUMAP()
function, itself based on the uwot package.
example_sce <- runUMAP(example_sce)
head(reducedDim(example_sce, "UMAP"))
## [,1] [,2]
## 1772071015_C02 -12.04324 -2.072196
## 1772071017_G12 -12.11942 -2.135925
## 1772071017_A05 -11.93046 -2.005134
## 1772071014_B06 -12.08696 -2.123133
## 1772067065_H06 -12.11039 -2.148298
## 1772071017_E02 -12.10755 -2.138869
Any dimensionality reduction result can be plotted using the plotReducedDim
function.
Here, each point represents a cell and is coloured according to its cell type label.
plotReducedDim(example_sce, dimred = "PCA", colour_by = "level1class")
Some result types have dedicated wrappers for convenience, e.g., plotTSNE()
for \(t\)-SNE results:
plotTSNE(example_sce, colour_by = "Snap25")
The dedicated plotPCA()
function also adds the percentage of variance explained to the axes:
plotPCA(example_sce, colour_by="Mog")
Multiple components can be plotted in a series of pairwise plots. When more than two components are plotted, the diagonal boxes in the scatter plot matrix show the density for each component.
example_sce <- runPCA(example_sce, ncomponents=20)
plotPCA(example_sce, ncomponents = 4, colour_by = "level1class")
We separate the execution of these functions from the plotting to enable the same coordinates to be re-used across multiple plots. This avoids repeatedly recomputing those coordinates just to change an aesthetic across plots.
We provide some helper functions to easily convert from a SingleCellExperiment
object to a data.frame
for use in, say, ggplot2 functions.
This allows users to create highly customized plots that are not covered by the existing scater functions.
The ggcells()
function will intelligently retrieve fields from the colData()
, assays()
, altExps()
or reducedDims()
to create a single data.frame
for immediate use.
In the example below, we create boxplots of Snap25 expression stratified by cell type and tissue of origin:
ggcells(example_sce, mapping=aes(x=level1class, y=Snap25)) +
geom_boxplot() +
facet_wrap(~tissue)
Reduced dimension results are easily pulled in to create customized equivalents of the plotReducedDim()
output.
In this example, we create a \(t\)-SNE plot faceted by tissue and coloured by Snap25 expression:
ggcells(example_sce, mapping=aes(x=TSNE.1, y=TSNE.2, colour=Snap25)) +
geom_point() +
stat_density_2d() +
facet_wrap(~tissue) +
scale_colour_distiller(direction=1)
It is also straightforward to examine the relationship between the size factors on the normalized gene expression:
ggcells(example_sce, mapping=aes(x=sizeFactor, y=Actb)) +
geom_point() +
geom_smooth()
Similar operations can be performed on the features using the ggfeatures()
function,
which will retrieve values either from rowData
or from the columns of the assays
.
Here, we examine the relationship between the feature type and the expression within a given cell;
note the renaming of the otherwise syntactically invalid cell name.
colnames(example_sce) <- make.names(colnames(example_sce))
ggfeatures(example_sce, mapping=aes(x=featureType, y=X1772062111_E06)) +
geom_violin()
SCESet
classAs of July 2017, scater
has switched from the SCESet
class previously defined within the package to the more widely applicable SingleCellExperiment
class.
From Bioconductor 3.6 (October 2017), the release version of scater
will use SingleCellExperiment
.
SingleCellExperiment
is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages.
Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets.
It should be straight-forward to convert existing scripts based on SCESet
objects to SingleCellExperiment
objects, with key changes outlined immediately below.
toSingleCellExperiment
and updateSCESet
(for backwards compatibility) can be used to convert an old SCESet
object to a SingleCellExperiment
object;SingleCellExperiment
object with the function SingleCellExperiment
(actually less fiddly than creating a new SCESet
);scater
functions have been refactored to take SingleCellExperiment
objects, so once data is in a SingleCellExperiment
object, the user experience is almost identical to that with the SCESet
class.Users may need to be aware of the following when updating their own scripts:
colnames
function (instead of sampleNames
or cellNames
for an SCESet
object);rownames
function (instead of featureNames
);phenoData
in an SCESet
, corresponds to colData
in a SingleCellExperiment
object and is accessed/assigned with the colData
function (this replaces the pData
function);$
operator (e.g. sce$sum
);featureData
in an SCESet
, corresponds to rowData
in a SingleCellExperiment
object and is accessed/assigned with the rowData
function (this replaces the fData
function);plotScater
, which produces a cumulative expression, overview plot, replaces
the generic plot
function for SCESet
objects.sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] scater_1.16.2 ggplot2_3.3.2
## [3] scRNAseq_2.2.0 SingleCellExperiment_1.10.1
## [5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0
## [7] matrixStats_0.56.0 Biobase_2.48.0
## [9] GenomicRanges_1.40.0 GenomeInfoDb_1.24.2
## [11] IRanges_2.22.2 S4Vectors_0.26.1
## [13] BiocGenerics_0.34.0 BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-148 bitops_1.0-6
## [3] bit64_0.9-7 RColorBrewer_1.1-2
## [5] httr_1.4.1 tools_4.0.0
## [7] R6_2.4.1 irlba_2.3.3
## [9] vipor_0.4.5 mgcv_1.8-31
## [11] uwot_0.1.8 DBI_1.1.0
## [13] colorspace_1.4-1 withr_2.2.0
## [15] tidyselect_1.1.0 gridExtra_2.3
## [17] bit_1.1-15.2 curl_4.3
## [19] compiler_4.0.0 BiocNeighbors_1.6.0
## [21] isoband_0.2.2 labeling_0.3
## [23] bookdown_0.20 scales_1.1.1
## [25] rappdirs_0.3.1 stringr_1.4.0
## [27] digest_0.6.25 rmarkdown_2.3
## [29] XVector_0.28.0 pkgconfig_2.0.3
## [31] htmltools_0.5.0 dbplyr_1.4.4
## [33] fastmap_1.0.1 rlang_0.4.6
## [35] RSQLite_2.2.0 FNN_1.1.3
## [37] shiny_1.5.0 DelayedMatrixStats_1.10.0
## [39] generics_0.0.2 farver_2.0.3
## [41] BiocParallel_1.22.0 dplyr_1.0.0
## [43] RCurl_1.98-1.2 magrittr_1.5
## [45] BiocSingular_1.4.0 GenomeInfoDbData_1.2.3
## [47] Matrix_1.2-18 Rcpp_1.0.4.6
## [49] ggbeeswarm_0.6.0 munsell_0.5.0
## [51] viridis_0.5.1 lifecycle_0.2.0
## [53] stringi_1.4.6 yaml_2.2.1
## [55] MASS_7.3-51.6 zlibbioc_1.34.0
## [57] Rtsne_0.15 BiocFileCache_1.12.0
## [59] AnnotationHub_2.20.0 grid_4.0.0
## [61] blob_1.2.1 promises_1.1.1
## [63] ExperimentHub_1.14.0 crayon_1.3.4
## [65] lattice_0.20-41 splines_4.0.0
## [67] cowplot_1.0.0 magick_2.4.0
## [69] knitr_1.29 pillar_1.4.4
## [71] glue_1.4.1 BiocVersion_3.11.1
## [73] evaluate_0.14 BiocManager_1.30.10
## [75] vctrs_0.3.1 httpuv_1.5.4
## [77] gtable_0.3.0 purrr_0.3.4
## [79] assertthat_0.2.1 xfun_0.15
## [81] rsvd_1.0.3 mime_0.9
## [83] xtable_1.8-4 RSpectra_0.16-0
## [85] later_1.1.0.1 viridisLite_0.3.0
## [87] tibble_3.0.1 AnnotationDbi_1.50.0
## [89] beeswarm_0.2.3 memoise_1.1.0
## [91] ellipsis_0.3.1 interactiveDisplayBase_1.26.3