Chapter 4 DE analyses between conditions

4.1 Motivation

A powerful use of scRNA-seq technology lies in the design of replicated multi-condition experiments to detect population-specific changes in expression between conditions. For example, a researcher could use this strategy to detect gene expression changes for each cell type after drug treatment (Richard et al. 2018) or genetic modifications (Scialdone et al. 2016). This provides more biological insight than conventional scRNA-seq experiments involving only one biological condition, especially if we can relate population changes to specific experimental perturbations. Here, we will focus on differential expression analyses of replicated multi-condition scRNA-seq experiments. Our aim is to find significant changes in expression between conditions for cells of the same type that are present in both conditions.

4.2 Setting up the data

Our demonstration scRNA-seq dataset was generated from chimeric mouse embryos at the E8.5 developmental stage (Pijuan-Sala et al. 2019). Each chimeric embryo was generated by injecting td-Tomato-positive embryonic stem cells (ESCs) into a wild-type (WT) blastocyst. Unlike in previous experiments (Scialdone et al. 2016), there is no genetic difference between the injected and background cells other than the expression of td-Tomato in the former. Instead, the aim of this “wild-type chimera” study is to determine whether the injection procedure itself introduces differences in lineage commitment compared to the background cells.

The experiment used a paired design with three replicate batches of two samples each. Specifically, each batch contains one sample consisting of td-Tomato positive cells and another consisting of negative cells, obtained by fluorescence-activated cell sorting from a single pool of dissociated cells from 6-7 chimeric embryos. For each sample, scRNA-seq data was generated using the 10X Genomics protocol (Zheng et al. 2017) to obtain 2000-7000 cells.

#--- loading ---#
library(MouseGastrulationData)
sce.chimera <- WTChimeraData(samples=5:10)
sce.chimera

#--- feature-annotation ---#
library(scater)
rownames(sce.chimera) <- uniquifyFeatureNames(
    rowData(sce.chimera)$ENSEMBL, rowData(sce.chimera)$SYMBOL)

#--- quality-control ---#
drop <- sce.chimera$celltype.mapped %in% c("stripped", "Doublet")
sce.chimera <- sce.chimera[,!drop]

#--- normalization ---#
sce.chimera <- logNormCounts(sce.chimera)

#--- variance-modelling ---#
library(scran)
dec.chimera <- modelGeneVar(sce.chimera, block=sce.chimera$sample)
chosen.hvgs <- dec.chimera$bio > 0

#--- merging ---#
library(batchelor)
set.seed(01001001)
merged <- correctExperiments(sce.chimera, 
    batch=sce.chimera$sample, 
    subset.row=chosen.hvgs,
    PARAM=FastMnnParam(
        merge.order=list(
            list(1,3,5), # WT (3 replicates)
            list(2,4,6)  # td-Tomato (3 replicates)
        )
    )
)

#--- clustering ---#
g <- buildSNNGraph(merged, use.dimred="corrected")
clusters <- igraph::cluster_louvain(g)
colLabels(merged) <- factor(clusters$membership)

#--- dimensionality-reduction ---#
merged <- runTSNE(merged, dimred="corrected", external_neighbors=TRUE)
merged <- runUMAP(merged, dimred="corrected", external_neighbors=TRUE)
merged
## class: SingleCellExperiment 
## dim: 14700 19426 
## metadata(2): merge.info pca.info
## assays(3): reconstructed counts logcounts
## rownames(14700): Xkr4 Rp1 ... Vmn2r122 CAAA01147332.1
## rowData names(3): rotation ENSEMBL SYMBOL
## colnames(19426): cell_9769 cell_9770 ... cell_30701 cell_30702
## colData names(13): batch cell ... sizeFactor label
## reducedDimNames(5): corrected pca.corrected.E7.5 pca.corrected.E8.5
##   TSNE UMAP
## mainExpName: NULL
## altExpNames(0):

The differential analyses in this chapter will be predicated on many of the pre-processing steps covered previously. For brevity, we will not explicitly repeat them here, only noting that we have already merged cells from all samples into the same coordinate system (Chapter 1) and clustered the merged dataset to obtain a common partitioning across all samples (Basic Chapter 5). A brief inspection of the results indicates that clusters contain similar contributions from all batches with only modest differences associated with td-Tomato expression (Figure 4.1).

library(scater)
table(colLabels(merged), merged$tomato)
##     
##      FALSE TRUE
##   1    147  299
##   2    364  509
##   3    419  634
##   4   1034  724
##   5    645  558
##   6    915 1131
##   7    535  397
##   8    473  403
##   9    256  321
##   10    72  183
##   11   226  197
##   12   256  247
##   13   586  421
##   14   421  311
##   15  1018  443
##   16   473  211
##   17   583  573
##   18   650  596
##   19   200  220
##   20   215  216
##   21   155    1
##   22    61   50
##   23   297  231
##   24    47   57
##   25    83   78
##   26   142   84
##   27    58    0
table(colLabels(merged), merged$pool)
##     
##         3    4    5
##   1   106  115  225
##   2   184  243  446
##   3   115  292  646
##   4   245  657  856
##   5   132  668  403
##   6   269  562 1215
##   7   222  170  540
##   8   228  176  472
##   9   102  107  368
##   10  116   54   85
##   11  151   85  187
##   12  187  113  203
##   13  236  233  538
##   14  179  169  384
##   15  131  510  820
##   16   78  163  443
##   17  209  382  565
##   18  121  307  818
##   19   69  134  217
##   20   79   99  253
##   21    6   84   66
##   22   25   30   56
##   23   86  170  272
##   24   16   31   57
##   25   27   29  105
##   26    3   10  213
##   27    2   51    5
gridExtra::grid.arrange(
    plotTSNE(merged, colour_by="tomato", text_by="label"),
    plotTSNE(merged, colour_by=data.frame(pool=factor(merged$pool))),
    ncol=2
)
$t$-SNE plot of the WT chimeric dataset, where each point represents a cell and is colored according to td-Tomato expression (left) or batch of origin (right). Cluster numbers are superimposed based on the median coordinate of cells assigned to that cluster.

Figure 4.1: \(t\)-SNE plot of the WT chimeric dataset, where each point represents a cell and is colored according to td-Tomato expression (left) or batch of origin (right). Cluster numbers are superimposed based on the median coordinate of cells assigned to that cluster.

Ordinarily, we would be obliged to perform marker detection to assign biological meaning to these clusters. For simplicity, we will skip this step by directly using the existing cell type labels provided by Pijuan-Sala et al. (2019). These were obtained by mapping the cells in this dataset to a larger, pre-annotated “atlas” of mouse early embryonic development. While there are obvious similarities, we see that many of our clusters map to multiple labels and vice versa (Figure 4.2), which reflects the difficulties in unambiguously resolving cell types undergoing differentiation.

library(bluster)
pairwiseRand(colLabels(merged), merged$celltype.mapped, "index")
## [1] 0.5727
by.label <- table(colLabels(merged), merged$celltype.mapped)
pheatmap::pheatmap(log2(by.label+1), color=viridis::viridis(101))
Heatmap showing the abundance of cells with each combination of cluster (row) and cell type label (column). The color scale represents the log~2~-count for each combination.

Figure 4.2: Heatmap showing the abundance of cells with each combination of cluster (row) and cell type label (column). The color scale represents the log2-count for each combination.

4.3 Creating pseudo-bulk samples

The most obvious differential analysis is to look for changes in expression between conditions. We perform the DE analysis separately for each label to identify cell type-specific transcriptional effects of injection. The actual DE testing is performed on “pseudo-bulk” expression profiles (Tung et al. 2017), generated by summing counts together for all cells with the same combination of label and sample. This leverages the resolution offered by single-cell technologies to define the labels, and combines it with the statistical rigor of existing methods for DE analyses involving a small number of samples.

# Using 'label' and 'sample' as our two factors; each column of the output
# corresponds to one unique combination of these two factors.
summed <- aggregateAcrossCells(merged, 
    id=colData(merged)[,c("celltype.mapped", "sample")])
summed
## class: SingleCellExperiment 
## dim: 14700 186 
## metadata(2): merge.info pca.info
## assays(1): counts
## rownames(14700): Xkr4 Rp1 ... Vmn2r122 CAAA01147332.1
## rowData names(3): rotation ENSEMBL SYMBOL
## colnames: NULL
## colData names(16): batch cell ... sample ncells
## reducedDimNames(5): corrected pca.corrected.E7.5 pca.corrected.E8.5
##   TSNE UMAP
## mainExpName: NULL
## altExpNames(0):

At this point, it is worth reflecting on the motivations behind the use of pseudo-bulking:

  • Larger counts are more amenable to standard DE analysis pipelines designed for bulk RNA-seq data. Normalization is more straightforward and certain statistical approximations are more accurate e.g., the saddlepoint approximation for quasi-likelihood methods or normality for linear models.
  • Collapsing cells into samples reflects the fact that our biological replication occurs at the sample level (Lun and Marioni 2017). Each sample is represented no more than once for each condition, avoiding problems from unmodelled correlations between samples. Supplying the per-cell counts directly to a DE analysis pipeline would imply that each cell is an independent biological replicate, which is not true from an experimental perspective. (A mixed effects model can handle this variance structure but involves extra statistical and computational complexity for little benefit, see Crowell et al. (2019).)
  • Variance between cells within each sample is masked, provided it does not affect variance across (replicate) samples. This avoids penalizing DEGs that are not uniformly up- or down-regulated for all cells in all samples of one condition. Masking is generally desirable as DEGs - unlike marker genes - do not need to have low within-sample variance to be interesting, e.g., if the treatment effect is consistent across replicate populations but heterogeneous on a per-cell basis. Of course, high per-cell variability will still result in weaker DE if it affects the variability across populations, while homogeneous per-cell responses will result in stronger DE due to a larger population-level log-fold change. These effects are also largely desirable.

4.4 Performing the DE analysis

Our DE analysis will be performed using quasi-likelihood (QL) methods from the edgeR package (Robinson, McCarthy, and Smyth 2010; Chen, Lun, and Smyth 2016). This uses a negative binomial generalized linear model (NB GLM) to handle overdispersed count data in experiments with limited replication. In our case, we have biological variation with three paired replicates per condition, so edgeR or its contemporaries is a natural choice for the analysis.

We do not use all labels for GLM fitting as the strong DE between labels makes it difficult to compute a sensible average abundance to model the mean-dispersion trend. Moreover, label-specific batch effects would not be easily handled with a single additive term in the design matrix for the batch. Instead, we arbitrarily pick one of the labels to use for this demonstration.

label <- "Mesenchyme"
current <- summed[,label==summed$celltype.mapped]

# Creating up a DGEList object for use in edgeR:
library(edgeR)
y <- DGEList(counts(current), samples=colData(current))
y
## An object of class "DGEList"
## $counts
##        Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
## Xkr4         2       0       0       0       3       0
## Rp1          0       0       1       0       0       0
## Sox17        7       0       3       0      14       9
## Mrpl15    1420     271    1009     379    1578     749
## Rgs20        3       0       1       1       0       0
## 14695 more rows ...
## 
## $samples
##         group lib.size norm.factors batch cell barcode sample stage tomato pool
## Sample1     1  4607113            1     5 <NA>    <NA>      5  E8.5   TRUE    3
## Sample2     1  1064981            1     6 <NA>    <NA>      6  E8.5  FALSE    3
## Sample3     1  2494039            1     7 <NA>    <NA>      7  E8.5   TRUE    4
## Sample4     1  1028679            1     8 <NA>    <NA>      8  E8.5  FALSE    4
## Sample5     1  4290259            1     9 <NA>    <NA>      9  E8.5   TRUE    5
## Sample6     1  1950858            1    10 <NA>    <NA>     10  E8.5  FALSE    5
##         stage.mapped celltype.mapped closest.cell doub.density sizeFactor label
## Sample1         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
## Sample2         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
## Sample3         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
## Sample4         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
## Sample5         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
## Sample6         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
##         celltype.mapped.1 sample.1 ncells
## Sample1        Mesenchyme        5    286
## Sample2        Mesenchyme        6     55
## Sample3        Mesenchyme        7    243
## Sample4        Mesenchyme        8    134
## Sample5        Mesenchyme        9    478
## Sample6        Mesenchyme       10    299

A typical step in bulk RNA-seq data analyses is to remove samples with very low library sizes due to failed library preparation or sequencing. The very low counts in these samples can be troublesome in downstream steps such as normalization (Basic Chapter 2) or for some statistical approximations used in the DE analysis. In our situation, this is equivalent to removing label-sample combinations that have very few or lowly-sequenced cells. The exact definition of “very low” will vary, but in this case, we remove combinations containing fewer than 10 cells (Crowell et al. 2019). Alternatively, we could apply the outlier-based strategy described in Basic Chapter 1, but this makes the strong assumption that all label-sample combinations have similar numbers of cells that are sequenced to similar depth. We defer to the usual diagnostics for bulk DE analyses to decide whether a particular pseudo-bulk profile should be removed.

discarded <- current$ncells < 10
y <- y[,!discarded]
summary(discarded)
##    Mode   FALSE 
## logical       6

Another typical step in bulk RNA-seq analyses is to remove genes that are lowly expressed. This reduces computational work, improves the accuracy of mean-variance trend modelling and decreases the severity of the multiple testing correction. Here, we use the filterByExpr() function from edgeR to remove genes that are not expressed above a log-CPM threshold in a minimum number of samples (determined from the size of the smallest treatment group in the experimental design).

keep <- filterByExpr(y, group=current$tomato)
y <- y[keep,]
summary(keep)
##    Mode   FALSE    TRUE 
## logical    9011    5689

Finally, we correct for composition biases by computing normalization factors with the trimmed mean of M-values method (Robinson and Oshlack 2010). We do not need the bespoke single-cell methods described in Basic Chapter 2, as the counts for our pseudo-bulk samples are large enough to apply bulk normalization methods. (Note that edgeR normalization factors are closely related but not the same as the size factors described elsewhere in this book. Size factors are proportional to the product of the normalization factors and the library sizes.)

y <- calcNormFactors(y)
y$samples
##         group lib.size norm.factors batch cell barcode sample stage tomato pool
## Sample1     1  4607113       1.0684     5 <NA>    <NA>      5  E8.5   TRUE    3
## Sample2     1  1064981       1.0487     6 <NA>    <NA>      6  E8.5  FALSE    3
## Sample3     1  2494039       0.9582     7 <NA>    <NA>      7  E8.5   TRUE    4
## Sample4     1  1028679       0.9774     8 <NA>    <NA>      8  E8.5  FALSE    4
## Sample5     1  4290259       0.9707     9 <NA>    <NA>      9  E8.5   TRUE    5
## Sample6     1  1950858       0.9817    10 <NA>    <NA>     10  E8.5  FALSE    5
##         stage.mapped celltype.mapped closest.cell doub.density sizeFactor label
## Sample1         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
## Sample2         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
## Sample3         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
## Sample4         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
## Sample5         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
## Sample6         <NA>      Mesenchyme         <NA>           NA         NA  <NA>
##         celltype.mapped.1 sample.1 ncells
## Sample1        Mesenchyme        5    286
## Sample2        Mesenchyme        6     55
## Sample3        Mesenchyme        7    243
## Sample4        Mesenchyme        8    134
## Sample5        Mesenchyme        9    478
## Sample6        Mesenchyme       10    299

As part of the usual diagnostics for a bulk RNA-seq DE analysis, we generate a mean-difference (MD) plot for each normalized pseudo-bulk profile (Figure 4.3). This should exhibit a trumpet shape centered at zero indicating that the normalization successfully removed systematic bias between profiles. Lack of zero-centering or dominant discrete patterns at low abundances may be symptomatic of deeper problems with normalization, possibly due to insufficient cells/reads/UMIs composing a particular pseudo-bulk profile.

par(mfrow=c(2,3))
for (i in seq_len(ncol(y))) {
    plotMD(y, column=i)
}
Mean-difference plots of the normalized expression values for each pseudo-bulk sample against the average of all other samples.

Figure 4.3: Mean-difference plots of the normalized expression values for each pseudo-bulk sample against the average of all other samples.

We also generate a multi-dimensional scaling (MDS) plot for the pseudo-bulk profiles (Figure 4.4). This is closely related to PCA and allows us to visualize the structure of the data in a manner similar to that described in Basic Chapter 4 (though we rarely have enough pseudo-bulk profiles to make use of techniques like \(t\)-SNE). Here, the aim is to check whether samples separate by our known factors of interest - in this case, injection status. Strong separation foreshadows a large number of DEGs in the subsequent analysis.

plotMDS(cpm(y, log=TRUE), 
    col=ifelse(y$samples$tomato, "red", "blue"))
MDS plot of the pseudo-bulk log-normalized CPMs, where each point represents a sample and is colored by the tomato status.

Figure 4.4: MDS plot of the pseudo-bulk log-normalized CPMs, where each point represents a sample and is colored by the tomato status.

We set up the design matrix to block on the batch-to-batch differences across different embryo pools, while retaining an additive term that represents the effect of injection. The latter is represented in our model as the log-fold change in gene expression in td-Tomato-positive cells over their negative counterparts within the same label. Our aim is to test whether this log-fold change is significantly different from zero.

design <- model.matrix(~factor(pool) + factor(tomato), y$samples)
design
##         (Intercept) factor(pool)4 factor(pool)5 factor(tomato)TRUE
## Sample1           1             0             0                  1
## Sample2           1             0             0                  0
## Sample3           1             1             0                  1
## Sample4           1             1             0                  0
## Sample5           1             0             1                  1
## Sample6           1             0             1                  0
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$`factor(pool)`
## [1] "contr.treatment"
## 
## attr(,"contrasts")$`factor(tomato)`
## [1] "contr.treatment"

We estimate the negative binomial (NB) dispersions with estimateDisp(). The role of the NB dispersion is to model the mean-variance trend (Figure 4.5), which is not easily accommodated by QL dispersions alone due to the quadratic nature of the NB mean-variance trend.

y <- estimateDisp(y, design)
summary(y$trended.dispersion)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0103  0.0167  0.0213  0.0202  0.0234  0.0266
plotBCV(y)
Biological coefficient of variation (BCV) for each gene as a function of the average abundance. The BCV is computed as the square root of the NB dispersion after empirical Bayes shrinkage towards the trend. Trended and common BCV estimates are shown in blue and red, respectively.

Figure 4.5: Biological coefficient of variation (BCV) for each gene as a function of the average abundance. The BCV is computed as the square root of the NB dispersion after empirical Bayes shrinkage towards the trend. Trended and common BCV estimates are shown in blue and red, respectively.

We also estimate the quasi-likelihood dispersions with glmQLFit() (Chen, Lun, and Smyth 2016). This fits a GLM to the counts for each gene and estimates the QL dispersion from the GLM deviance. We set robust=TRUE to avoid distortions from highly variable clusters (Phipson et al. 2016). The QL dispersion models the uncertainty and variability of the per-gene variance (Figure 4.6) - which is not well handled by the NB dispersions, so the two dispersion types complement each other in the final analysis.

fit <- glmQLFit(y, design, robust=TRUE)
summary(fit$var.prior)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.303   1.001   1.095   1.026   1.149   1.177
summary(fit$df.prior)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.242  12.252  12.252  11.956  12.252  12.252
plotQLDisp(fit)
QL dispersion estimates for each gene as a function of abundance. Raw estimates (black) are shrunk towards the trend (blue) to yield squeezed estimates (red).

Figure 4.6: QL dispersion estimates for each gene as a function of abundance. Raw estimates (black) are shrunk towards the trend (blue) to yield squeezed estimates (red).

We test for differences in expression due to injection using glmQLFTest(). DEGs are defined as those with non-zero log-fold changes at a false discovery rate of 5%. Very few genes are significantly DE, indicating that injection has little effect on the transcriptome of mesenchyme cells. (Note that this logic is somewhat circular, as a large transcriptional effect may have caused cells of this type to be re-assigned to a different label. We discuss this in more detail in Section 6.4.1 below.)

res <- glmQLFTest(fit, coef=ncol(design))
summary(decideTests(res))
##        factor(tomato)TRUE
## Down                    8
## NotSig               5673
## Up                      8
topTags(res)
## Coefficient:  factor(tomato)TRUE 
##            logFC logCPM       F    PValue       FDR
## Phlda2   -4.3876  9.934 1608.60 4.624e-16 2.631e-12
## Erdr1     2.0662  8.833  346.36 2.118e-11 6.026e-08
## Mid1      1.5165  6.931  119.13 2.618e-08 4.964e-05
## Akr1e1   -1.7266  5.128   84.65 2.247e-07 2.961e-04
## Kcnq1ot1  1.3825  7.242   83.08 2.722e-07 2.961e-04
## H13      -1.0599  7.540   80.22 3.123e-07 2.961e-04
## Zdbf2     1.8021  6.797   83.18 7.582e-07 6.162e-04
## Xist     -5.5784  7.522  295.93 2.305e-06 1.639e-03
## Asb4     -0.9203  7.341   53.01 3.630e-06 2.294e-03
## Impact    0.8509  7.353   49.69 5.237e-06 2.979e-03

4.5 Putting it all together

4.5.1 Looping across labels

Now that we have laid out the theory underlying the DE analysis, we repeat this process for each of the labels to identify injection-induced DE in each cell type. This is conveniently done using the pseudoBulkDGE() function from scran, which will loop over all labels and apply the exact analysis described above to each label. Users can also set method="voom" to perform an equivalent analysis using the voom() pipeline from limma - see Workflow Section 8.9 for the full set of function calls.

# Removing all pseudo-bulk samples with 'insufficient' cells.
summed.filt <- summed[,summed$ncells >= 10]

library(scran)
de.results <- pseudoBulkDGE(summed.filt, 
    label=summed.filt$celltype.mapped,
    design=~factor(pool) + tomato,
    coef="tomatoTRUE",
    condition=summed.filt$tomato 
)

The function returns a list of DataFrames containing the DE results for each label. Each DataFrame also contains the intermediate edgeR objects used in the DE analyses, which can be used to generate any of previously described diagnostic plots (Figure 4.7). It is often wise to generate these plots to ensure that any interesting results are not compromised by technical issues.

cur.results <- de.results[["Allantois"]]
cur.results[order(cur.results$PValue),]
## DataFrame with 14700 rows and 5 columns
##                               logFC    logCPM         F      PValue         FDR
##                           <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Phlda2                     -2.49017  12.58170  1239.233 6.72351e-23 3.23670e-19
## Xist                       -7.98184   8.00187  1196.966 1.71686e-21 4.13248e-18
## Erdr1                       1.93844   9.07313   293.076 2.90970e-15 4.66910e-12
## Slc22a18                   -4.33835   4.04375   122.660 3.71575e-11 4.47190e-08
## Slc38a4                     0.89182  10.24084   112.383 9.23191e-11 8.88849e-08
## ...                             ...       ...       ...         ...         ...
## Ccl27a_ENSMUSG00000095247        NA        NA        NA          NA          NA
## CR974586.5                       NA        NA        NA          NA          NA
## AC132444.6                       NA        NA        NA          NA          NA
## Vmn2r122                         NA        NA        NA          NA          NA
## CAAA01147332.1                   NA        NA        NA          NA          NA
y.allantois <- metadata(cur.results)$y
plotBCV(y.allantois)
Biological coefficient of variation (BCV) for each gene as a function of the average abundance for the allantois pseudo-bulk analysis. Trended and common BCV estimates are shown in blue and red, respectively.

Figure 4.7: Biological coefficient of variation (BCV) for each gene as a function of the average abundance for the allantois pseudo-bulk analysis. Trended and common BCV estimates are shown in blue and red, respectively.

We list the labels that were skipped due to the absence of replicates or contrasts. If it is necessary to extract statistics in the absence of replicates, several strategies can be applied such as reducing the complexity of the model or using a predefined value for the NB dispersion. We refer readers to the edgeR user’s guide for more details.

metadata(de.results)$failed
## [1] "Blood progenitors 1" "Caudal epiblast"     "Caudal neurectoderm"
## [4] "ExE ectoderm"        "Parietal endoderm"   "Stripped"

4.5.2 Cross-label meta-analyses

We examine the numbers of DEGs at a FDR of 5% for each label using the decideTestsPerLabel() function. In general, there seems to be very little differential expression that is introduced by injection. Note that genes listed as NA were either filtered out as low-abundance genes for a given label’s analysis, or the comparison of interest was not possible for a particular label, e.g., due to lack of residual degrees of freedom or an absence of samples from both conditions.

is.de <- decideTestsPerLabel(de.results, threshold=0.05)
summarizeTestsPerLabel(is.de)
##                                -1    0  1    NA
## Allantois                      24 4765 25  9886
## Blood progenitors 2             0 2474  1 12225
## Cardiomyocytes                  5 4363  5 10327
## Caudal Mesoderm                 2 1742  0 12956
## Def. endoderm                   6 1393  2 13299
## Endothelium                     3 3221  7 11469
## Erythroid1                      9 2779 16 11896
## Erythroid2                      5 3390  8 11297
## Erythroid3                     11 5054 13  9622
## ExE mesoderm                    4 5094 12  9590
## Forebrain/Midbrain/Hindbrain    9 6224 13  8454
## Gut                             5 4484  5 10206
## Haematoendothelial progenitors  7 4098 12 10583
## Intermediate mesoderm           4 3071  5 11620
## Mesenchyme                      8 5673  8  9011
## NMP                             7 4104 13 10576
## Neural crest                    6 3311  8 11375
## Paraxial mesoderm               4 4756  6  9934
## Pharyngeal mesoderm             2 5083  9  9606
## Rostral neurectoderm            4 3336  3 11357
## Somitic mesoderm                8 2939 21 11732
## Spinal cord                     7 4590  9 10094
## Surface ectoderm               10 5556  8  9126

For each gene, we compute the percentage of cell types in which that gene is upregulated or downregulated upon injection. Here, we consider a gene to be non-DE if it is not retained after filtering. We see that Xist is consistently downregulated in the injected cells; this is consistent with the fact that the injected cells are male while the background cells are derived from pools of male and female embryos, due to experimental difficulties with resolving sex at this stage. The consistent downregulation of Phlda2 and Cdkn1c in the injected cells is also interesting given that both are imprinted genes. However, some of these commonalities may be driven by shared contamination from ambient RNA - we discuss this further in Section 5.

# Upregulated across most cell types.
up.de <- is.de > 0 & !is.na(is.de)
head(sort(rowMeans(up.de), decreasing=TRUE), 10)
##     Mid1    Erdr1    Mcts2   Impact     Nnat Kcnq1ot1  Slc38a4     Hopx 
##   0.8696   0.8696   0.6087   0.6087   0.5217   0.5217   0.4348   0.3913 
##    Zdbf2     Peg3 
##   0.3478   0.2609
# Downregulated across cell types.
down.de <- is.de < 0 & !is.na(is.de)
head(sort(rowMeans(down.de), decreasing=TRUE), 10)
##          Xist        Phlda2        Cdkn1c        Akr1e1           H13 
##       0.73913       0.73913       0.69565       0.69565       0.60870 
##         Wfdc2 B230312C02Rik B930036N10Rik       S100a11         Pink1 
##       0.21739       0.13043       0.08696       0.08696       0.08696

To identify label-specific DE, we use the pseudoBulkSpecific() function to test for significant differences from the average log-fold change over all other labels. More specifically, the null hypothesis for each label and gene is that the log-fold change lies between zero and the average log-fold change of the other labels. If a gene rejects this null for our label of interest, we can conclude that it exhibits DE that is more extreme or of the opposite sign compared to that in the majority of other labels (Figure 4.8). This approach is effectively a poor man’s interaction model that sacrifices the uncertainty of the average for an easier compute. We note that, while the difference from the average is a good heuristic, there is no guarantee that the top genes are truly label-specific; comparable DE in a subset of the other labels may be offset by weaker effects when computing the average.

de.specific <- pseudoBulkSpecific(summed.filt,
    label=summed.filt$celltype.mapped,
    design=~factor(pool) + tomato,
    coef="tomatoTRUE",
    condition=summed.filt$tomato
)

cur.specific <- de.specific[["Allantois"]]
cur.specific <- cur.specific[order(cur.specific$PValue),]
cur.specific
## DataFrame with 14700 rows and 6 columns
##                               logFC    logCPM         F      PValue         FDR
##                           <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Slc22a18                  -4.338349   4.04375  122.6597 3.71575e-11 1.78876e-07
## Acta2                     -0.834920   9.12478   56.7378 2.08796e-07 5.02572e-04
## Mxd4                      -1.421750   5.64602   52.6483 8.49144e-07 1.36259e-03
## Rbp4                       1.891342   4.35443   31.5479 7.36549e-06 8.86437e-03
## Myl9                      -0.988264   6.24832   32.2410 2.42493e-05 2.31751e-02
## ...                             ...       ...       ...         ...         ...
## Ccl27a_ENSMUSG00000095247        NA        NA        NA          NA          NA
## CR974586.5                       NA        NA        NA          NA          NA
## AC132444.6                       NA        NA        NA          NA          NA
## Vmn2r122                         NA        NA        NA          NA          NA
## CAAA01147332.1                   NA        NA        NA          NA          NA
##                           OtherAverage
##                              <numeric>
## Slc22a18                            NA
## Acta2                       -0.0231938
## Mxd4                        -0.1529783
## Rbp4                        -0.0853716
## Myl9                        -0.0908101
## ...                                ...
## Ccl27a_ENSMUSG00000095247           NA
## CR974586.5                          NA
## AC132444.6                          NA
## Vmn2r122                            NA
## CAAA01147332.1                      NA
sizeFactors(summed.filt) <- NULL
plotExpression(logNormCounts(summed.filt),
    features="Rbp4",
    x="tomato", colour_by="tomato",
    other_fields="celltype.mapped") +
    facet_wrap(~celltype.mapped)
Distribution of summed log-expression values for _Rbp4_ in each label of the chimeric embryo dataset. Each facet represents a label with distributions stratified by injection status.

Figure 4.8: Distribution of summed log-expression values for Rbp4 in each label of the chimeric embryo dataset. Each facet represents a label with distributions stratified by injection status.

For greater control over the identification of label-specific DE, we can use the output of decideTestsPerLabel() to identify genes that are significant in our label of interest yet not DE in any other label. As hypothesis tests are not typically geared towards identifying genes that are not DE, we use an ad hoc approach where we consider a gene to be consistent with the null hypothesis for a label if it fails to be detected at a generous FDR threshold of 50%. We demonstrate this approach below by identifying injection-induced DE genes that are unique to the allantois. It is straightforward to tune the selection, e.g., to genes that are DE in no more than 90% of other labels by simply relaxing the threshold used to construct not.de.other, or to genes that are DE across multiple labels of interest but not in the rest, and so on.

# Finding all genes that are not remotely DE in all other labels.
remotely.de <- decideTestsPerLabel(de.results, threshold=0.5)
not.de <- remotely.de==0 | is.na(remotely.de)
not.de.other <- rowMeans(not.de[,colnames(not.de)!="Allantois"])==1

# Intersecting with genes that are DE inthe allantois.
unique.degs <- is.de[,"Allantois"]!=0 & not.de.other
unique.degs <- names(which(unique.degs))

# Inspecting the results.
de.allantois <- de.results$Allantois
de.allantois <- de.allantois[unique.degs,]
de.allantois <- de.allantois[order(de.allantois$PValue),]
de.allantois
## DataFrame with 6 rows and 5 columns
##              logFC    logCPM         F      PValue         FDR
##          <numeric> <numeric> <numeric>   <numeric>   <numeric>
## Slc22a18 -4.338349   4.04375  122.6597 3.71575e-11 4.47190e-08
## Rbp4      1.891342   4.35443   31.5479 7.36549e-06 1.61170e-03
## Cfc1     -0.949326   5.74759   24.2460 4.48304e-05 6.64097e-03
## H3f3b     0.320930  12.04010   21.4501 9.58675e-05 1.30397e-02
## Cryab    -0.983425   5.28419   21.0006 1.08837e-04 1.37880e-02
## Alcam     1.279990   4.70533   17.3437 3.22500e-04 3.30323e-02

The main caveat is that differences in power between labels require some caution when interpreting label specificity. For example, Figure 4.9 shows that the top-ranked allantois-specific gene exhibits some evidence of DE in other labels but was not detected for various reasons like low abundance or insufficient replicates. A more correct but complex approach would be to fit a interaction model to the pseudo-bulk profiles for each pair of labels, where the interaction is between the coefficient of interest and the label identity; this is left as an exercise for the reader.

plotExpression(logNormCounts(summed.filt), 
    features="Slc22a18",
    x="tomato", colour_by="tomato", 
    other_fields="celltype.mapped") + 
    facet_wrap(~celltype.mapped)
Distribution of summed log-expression values for each label in the chimeric embryo dataset. Each facet represents a label with distributions stratified by injection status.

Figure 4.9: Distribution of summed log-expression values for each label in the chimeric embryo dataset. Each facet represents a label with distributions stratified by injection status.

4.6 Testing for between-label differences

The above examples focus on testing for differences in expression between conditions for the same cell type or label. However, the same methodology can be applied to test for differences between cell types across samples. This kind of DE analysis overcomes the lack of suitable replication discussed in Advanced Section 6.4.2. To demonstrate, say we want to test for DEGs between the neural crest and notochord samples. We subset our summed counts to those two cell types and we run the edgeR workflow via pseudoBulkDGE().

summed.sub <- summed[,summed$celltype.mapped %in% c("Neural crest", "Notochord")]

# Using a dummy value for the label to allow us to include multiple cell types
# in the fitted model; otherwise, each cell type will be processed separately.
between.res <- pseudoBulkDGE(summed.sub,
    label=rep("dummy", ncol(summed.sub)),
    design=~factor(sample) + celltype.mapped,
    coef="celltype.mappedNotochord")[[1]]

table(Sig=between.res$FDR <= 0.05, Sign=sign(between.res$logFC))
##        Sign
## Sig       -1    1
##   FALSE 2177 1609
##   TRUE   757  234
between.res[order(between.res$PValue),]
## DataFrame with 14700 rows and 5 columns
##                               logFC    logCPM         F      PValue         FDR
##                           <numeric> <numeric> <numeric>   <numeric>   <numeric>
## T                          10.88649   7.07425   699.549 2.71654e-19 1.29769e-15
## Krt19                       8.17990   6.16367   487.402 3.05471e-17 7.29617e-14
## Krt8                        4.41066   8.43123   399.071 1.18421e-15 1.88566e-12
## Krt18                       4.71635   7.65059   387.371 1.61956e-15 1.93416e-12
## Ifitm1                      4.42639   7.48683   300.309 2.29971e-14 2.19714e-11
## ...                             ...       ...       ...         ...         ...
## Ccl27a_ENSMUSG00000095247        NA        NA        NA          NA          NA
## CR974586.5                       NA        NA        NA          NA          NA
## AC132444.6                       NA        NA        NA          NA          NA
## Vmn2r122                         NA        NA        NA          NA          NA
## CAAA01147332.1                   NA        NA        NA          NA          NA

We inspect some of the top hits in more detail (Figure 4.10). As one might expect, these two cell types are quite different.

summed.sub <- logNormCounts(summed.sub, size.factors=NULL)
plotExpression(summed.sub, 
    features=head(rownames(between.res)[order(between.res$PValue)]),
    x="celltype.mapped", 
    colour_by=I(factor(summed.sub$sample)))
Distribution of the log-expression values for the top DEGs between the neural crest and notochord. Each point represents a pseudo-bulk profile and is colored by the sample of origin.

Figure 4.10: Distribution of the log-expression values for the top DEGs between the neural crest and notochord. Each point represents a pseudo-bulk profile and is colored by the sample of origin.

Whether or not this is a scientifically meaningful comparison depends on the nature of the labels. These particular labels were defined by clustering, which means that the presence of DEGs is a foregone conclusion (Advanced Section 6.4). Nonetheless, it may have some utility for applications where the labels are defined using independent information, e.g., from FACS.

The same approach can also be used to test whether the log-fold changes between two labels are significantly different between conditions. This is equivalent to testing for a significant interaction between each cell’s label and the condition of its sample of origin. The \(p\)-values are likely to be more sensible here; any artificial differences induced by clustering should cancel out between conditions, leaving behind real (and interesting) differences. Some extra effort is usually required to obtain a full-rank design matrix - this is demonstrated below to test for a significant interaction between the notochord/neural crest separation and injection status (tomato).

inter.res <- pseudoBulkDGE(summed.sub,
    label=rep("dummy", ncol(summed.sub)),
    design=function(df) {
        combined <- with(df, paste0(tomato, ".", celltype.mapped))
        combined <- make.names(combined)
        design <- model.matrix(~0 + factor(sample) + combined, df)
        design[,!grepl("Notochord", colnames(design))]
    },
    coef="combinedTRUE.Neural.crest"
)[[1]]

table(Sig=inter.res$FDR <= 0.05, Sign=sign(inter.res$logFC))
##        Sign
## Sig       -1    0    1
##   FALSE 1443   13 3321
inter.res[order(inter.res$PValue),]
## DataFrame with 14700 rows and 5 columns
##                               logFC    logCPM         F     PValue       FDR
##                           <numeric> <numeric> <numeric>  <numeric> <numeric>
## Tppp3                      -7.90184   7.58955  15.89945 0.00042930  0.939424
## Lysmd2                     -3.79154   6.16785   9.13590 0.00488994  0.939424
## Hoxaas3                    -5.18306   4.37341   8.86842 0.00586773  0.939424
## Epha2                      -8.87571   4.24067   8.36143 0.00730976  0.939424
## Ddr1                       -4.78487   4.58006   7.97487 0.00850541  0.939424
## ...                             ...       ...       ...        ...       ...
## Ccl27a_ENSMUSG00000095247        NA        NA        NA         NA        NA
## CR974586.5                       NA        NA        NA         NA        NA
## AC132444.6                       NA        NA        NA         NA        NA
## Vmn2r122                         NA        NA        NA         NA        NA
## CAAA01147332.1                   NA        NA        NA         NA        NA

Session Info

R version 4.4.0 RC (2024-04-16 r86468)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] scran_1.31.3                edgeR_4.1.33               
 [3] limma_3.59.10               bluster_1.13.0             
 [5] scater_1.31.2               ggplot2_3.5.1              
 [7] scuttle_1.13.1              BiocSingular_1.19.0        
 [9] SingleCellExperiment_1.25.1 SummarizedExperiment_1.33.3
[11] Biobase_2.63.1              GenomicRanges_1.55.4       
[13] GenomeInfoDb_1.39.14        IRanges_2.37.1             
[15] S4Vectors_0.41.7            BiocGenerics_0.49.1        
[17] MatrixGenerics_1.15.1       matrixStats_1.3.0          
[19] BiocStyle_2.31.0            rebook_1.13.0              

loaded via a namespace (and not attached):
 [1] gridExtra_2.3             CodeDepends_0.6.6        
 [3] rlang_1.1.3               magrittr_2.0.3           
 [5] compiler_4.4.0            dir.expiry_1.11.0        
 [7] DelayedMatrixStats_1.25.4 vctrs_0.6.5              
 [9] pkgconfig_2.0.3           crayon_1.5.2             
[11] fastmap_1.1.1             XVector_0.43.1           
[13] labeling_0.4.3            utf8_1.2.4               
[15] rmarkdown_2.26            graph_1.81.1             
[17] UCSC.utils_0.99.7         ggbeeswarm_0.7.2         
[19] xfun_0.43                 zlibbioc_1.49.3          
[21] cachem_1.0.8              beachmat_2.19.4          
[23] jsonlite_1.8.8            highr_0.10               
[25] DelayedArray_0.29.9       BiocParallel_1.37.1      
[27] irlba_2.3.5.1             parallel_4.4.0           
[29] cluster_2.1.6             R6_2.5.1                 
[31] bslib_0.7.0               RColorBrewer_1.1-3       
[33] jquerylib_0.1.4           Rcpp_1.0.12              
[35] bookdown_0.39             knitr_1.46               
[37] Matrix_1.7-0              splines_4.4.0            
[39] igraph_2.0.3              tidyselect_1.2.1         
[41] abind_1.4-5               yaml_2.3.8               
[43] viridis_0.6.5             codetools_0.2-20         
[45] lattice_0.22-6            tibble_3.2.1             
[47] withr_3.0.0               evaluate_0.23            
[49] pillar_1.9.0              BiocManager_1.30.22      
[51] filelock_1.0.3            generics_0.1.3           
[53] sparseMatrixStats_1.15.1  munsell_0.5.1            
[55] scales_1.3.0              glue_1.7.0               
[57] metapod_1.11.1            pheatmap_1.0.12          
[59] tools_4.4.0               BiocNeighbors_1.21.2     
[61] ScaledMatrix_1.11.1       locfit_1.5-9.9           
[63] XML_3.99-0.16.1           cowplot_1.1.3            
[65] grid_4.4.0                colorspace_2.1-0         
[67] GenomeInfoDbData_1.2.12   beeswarm_0.4.0           
[69] vipor_0.4.7               cli_3.6.2                
[71] rsvd_1.0.5                fansi_1.0.6              
[73] S4Arrays_1.3.7            viridisLite_0.4.2        
[75] dplyr_1.1.4               gtable_0.3.5             
[77] sass_0.4.9                digest_0.6.35            
[79] dqrng_0.3.2               SparseArray_1.3.7        
[81] ggrepel_0.9.5             farver_2.1.1             
[83] htmltools_0.5.8.1         lifecycle_1.0.4          
[85] httr_1.4.7                statmod_1.5.0            

References

Chen, Y., A. T. Lun, and G. K. Smyth. 2016. “From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline.” F1000Res 5: 1438.

Crowell, H. L., C. Soneson, P.-L. Germain, D. Calini, L. Collin, C. Raposo, D. Malhotra, and M. D. Robinson. 2019. “On the Discovery of Population-Specific State Transitions from Multi-Sample Multi-Condition Single-Cell Rna Sequencing Data.” bioRxiv. https://doi.org/10.1101/713412.

Lun, A. T. L., and J. C. Marioni. 2017. “Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data.” Biostatistics 18 (3): 451–64.

Phipson, B., S. Lee, I. J. Majewski, W. S. Alexander, and G. K. Smyth. 2016. “Robust Hyperparameter Estimation Protects Against Hypervariable Genes and Improves Power to Detect Differential Expression.” Ann. Appl. Stat. 10 (2): 946–63.

Pijuan-Sala, B., J. A. Griffiths, C. Guibentif, T. W. Hiscock, W. Jawaid, F. J. Calero-Nieto, C. Mulas, et al. 2019. “A Single-Cell Molecular Map of Mouse Gastrulation and Early Organogenesis.” Nature 566 (7745): 490–95.

Richard, A. C., A. T. L. Lun, W. W. Y. Lau, B. Gottgens, J. C. Marioni, and G. M. Griffiths. 2018. “T cell cytolytic capacity is independent of initial stimulation strength.” Nat. Immunol. 19 (8): 849–58.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40.

Robinson, M. D., and A. Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biol. 11 (3): R25.

Scialdone, A., Y. Tanaka, W. Jawaid, V. Moignard, N. K. Wilson, I. C. Macaulay, J. C. Marioni, and B. Gottgens. 2016. “Resolving early mesoderm diversification through single-cell expression profiling.” Nature 535 (7611): 289–93.

Tung, P. Y., J. D. Blischak, C. J. Hsiao, D. A. Knowles, J. E. Burnett, J. K. Pritchard, and Y. Gilad. 2017. “Batch effects and the effective design of single-cell gene expression studies.” Sci. Rep. 7 (January): 39921.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.