scDiagnostics 0.99.6
Annotation transfer from a reference dataset for the cell type annotation of a new query single-cell RNA-sequencing (scRNA-seq) experiment is an integral component of the typical analysis workflow. The approach provides a fast, automated, and reproducible alternative to the manual annotation of cell clusters based on marker gene expression. However, dataset imbalance and undiagnosed incompatibilities between query and reference dataset can lead to erroneous annotation and distort downstream applications.
The scDiagnostics
package provides functionality for the systematic
evaluation of cell type assignments in scRNA-seq data. scDiagnostics
offers a suite of diagnostic functions to assess whether both (query and
reference) datasets are aligned, ensuring that annotations can be
transferred reliably. scDiagnostics
also provides functionality to
assess annotation ambiguity, cluster heterogeneity, and marker gene
alignment. The implemented functionality helps researchers to determine
how accurately cells from a new scRNA-seq experiment can be assigned to
known cell types.
To install the development version of the package from Github, use the following command:
BiocManager::install("ccb-hms/scDiagnostics")
NOTE: you will need the remotes package to install from GitHub.
To build the package vignettes upon installation use:
BiocManager::install("ccb-hms/scDiagnostics",
build_vignettes = TRUE,
dependencies = TRUE)
To explore the capabilities of the scDiagnostics package, you can load your own data or utilize publicly available datasets obtained from the scRNAseq R package. In this guide, we will demonstrate how to use scDiagnostics with such datasets, which serve as valuable resources for exploring the package and assessing the appropriateness of cell type assignments.
library(scDiagnostics)
library(scRNAseq)
library(scater)
library(scran)
library(scuttle)
library(SingleR)
library(AUCell)
library(celldex)
Here, we will consider the Human Primary Cell Atlas (Mabbott et al. 2013) as a reference dataset and our query dataset consists of Haematopoietic stem and progenitor cells from (Bunis DG et al. 2021).
In scRNA-seq studies, assessing the quality of cells is important for accurate downstream analyses. At the same time, assigning accurate cell type labels based on gene expression profiles is an integral aspect of scRNA-seq data interpretation. Generally, these two are performed independently of each other. The rationale behind this function is to inspect whether certain QC (Quality Control) criteria impact the confidence level of cell type annotations.
For instance, it is reasonable to hypothesize that higher library sizes could contribute to increased annotation confidence due to enhanced statistical power for identifying cell type-specific gene expression patterns, as evident in the scatter plot below.
# load reference dataset
ref_data <- fetchReference("hpca", "2024-02-26")
# Load query dataset (Bunis haematopoietic stem and progenitor cell
# data) from Bunis DG et al. (2021). Single-Cell Mapping of
# Progressive Fetal-to-Adult Transition in Human Naive T Cells Cell
# Rep. 34(1): 108573
query_data <- BunisHSPCData()
rownames(query_data) <- rowData(query_data)$Symbol
# Add QC metrics to query data
query_data <- addPerCellQCMetrics(query_data)
# Log transform query dataset
query_data <- logNormCounts(query_data)
# Run SingleR to predict cell types
pred <- SingleR(query_data, ref_data, labels = ref_data$label.main)
# Assign predicted labels to query data
colData(query_data)$pred.labels <- pred$labels
# Get annotation scores
scores <- apply(pred$scores, 1, max)
# Assign scores to query data
colData(query_data)$cell_scores <- scores
# Create a scatter plot between library size and annotation scores
p1 <- plotQCvsAnnotation(query_data = query_data,
qc_col = "total",
label_col = "pred.labels",
score_col = "cell_scores",
label = NULL)
p1 + xlab("Library Size")
However, certain QC metrics, such as the proportion of mitochondrial genes, may require careful consideration as they can sometimes be associated with cellular states or functions rather than noise. The interpretation of mitochondrial content should be context-specific and informed by biological knowledge.
In next analysis, we investigated the relationship between mitochondrial percentage and cell type annotation scores using liver tissue data from He S et al. 2020. Notably, we observed high annotation scores for macrophages and monocytes. These findings align with the established biological characteristic of high mitochondrial activity in macrophages and monocytes, adding biological context to our results.
# load query dataset
query_data <- HeOrganAtlasData(tissue = c("Liver"), ensembl = FALSE, location = TRUE)
# Add QC metrics to query data
mito_genes <- rownames(query_data)[grep("^MT-", rownames(query_data))]
query_data <- unfiltered <- addPerCellQC(query_data,subsets = list(mt = mito_genes))
qc <- quickPerCellQC(colData(query_data), sub.fields = "subsets_mt_percent")
query_data <- query_data[,!qc$discard]
# Log transform query dataset
query_data <- logNormCounts(query_data)
# Run SingleR to predict cell types
pred <- SingleR(query_data, ref_data, labels = ref_data$label.main)
# Assign predicted labels to query data
colData(query_data)$pred.labels <- pred$labels
# Get annotation scores
scores <- apply(pred$scores, 1, max)
# Assign scores to query data
colData(query_data)$cell_scores <- scores
# Create a new column for the labels so it is easy to distinguish
# between Macrophoges, Monocytes and other cells
query_data$label_category <-
ifelse(query_data$pred.labels %in% c("Macrophage", "Monocyte"),
query_data$pred.labels,
"Other Cells")
# Define custom colors for cell type labels
cols <- c("Other Cells" = "grey", "Macrophage" = "green", "Monocyte" = "red")
# Generate scatter plot for all cell types
p2 <- plotQCvsAnnotation(query_data = query_data,
qc_col = "subsets_mt_percent",
label_col = "label_category",
score_col = "cell_scores",
label = NULL)
p2 + scale_color_manual(values = cols) +
xlab("Subsets Mitochondrial Percentage (%)")
In addition to the scatter plot, we can gain further insights into the gene expression profiles by visualizing the distribution of user defined QC stats and annotation scores for all the cell types or specific cell types. This allows us to examine the variation and patterns in expression levels and scores across cells assigned to the cell type of interest.
To accomplish this, we create two separate histograms. The first histogram displays the distribution of the annotation scores.
The second histogram visualizes the distribution of QC stats. This provides insights into the overall gene expression levels for the specific cell type. Here in this particular example we are investigating percentage of mitochondrial genes.
By examining the histograms, we can observe the range, shape, and potential outliers in the distribution of both annotation scores and QC stats. This allows us to assess the appropriateness of the cell type assignments and identify any potential discrepancies or patterns in the gene expression profiles for the specific cell type.
# Generate histogram
histQCvsAnnotation(query_data = query_data, qc_col = "subsets_mt_percent",
label_col = "pred.labels",
score_col = "cell_scores",
label = NULL)
The right-skewed distribution for mitochondrial percentages and a left-skewed distribution for annotation scores in above histograms suggest that most cells have lower mitochondrial contamination and higher confidence in their assigned cell types.
This function helps user to explore the distribution of gene expression values for a specific gene of interest across all the cells in both reference and query datasets and within specific cell types. This helps to evaluate whether the distributions are similar or aligned between the datasets. Discrepancies in distribution patterns may indicate potential incompatibilities or differences between the datasets.
The function also allows users to narrow down their analysis to specific cell types of interest. This enables investigation of whether alignment between the query and reference datasets is consistent not only at a global level but also within specific cell types.
# Load data
sce <- HeOrganAtlasData(tissue = c("Marrow"), ensembl = FALSE)
# Divide the data into reference and query datasets
set.seed(100)
indices <- sample(ncol(assay(sce)), size = floor(0.7 * ncol(assay(sce))), replace = FALSE)
ref_data <- sce[, indices]
query_data <- sce[, -indices]
# Log-transform datasets
ref_data <- logNormCounts(ref_data)
query_data <- logNormCounts(query_data)
# Get cell type scores using SingleR
pred <- SingleR(query_data, ref_data, labels = ref_data$reclustered.broad)
# Assign labels to query data
colData(query_data)$labels <- pred$labels
# Generate density plots
plotMarkerExpression(reference_data = ref_data,
query_data = query_data,
ref_cell_type_col = "reclustered.broad",
query_cell_type_col = "labels",
gene_name = "MS4A1",
label = "B_and_plasma")
In the provided example, we examined the distribution of expression values for the gene MS4A1, a marker for naive B cells, in both the query and reference datasets. Additionally, we also looked at the distribution of MS4A1 expression in the B_and_plasma cell type. We observed overlapping distributions in both cases, suggesting alignment between the reference and query datasets.
To gain insights into the gene expression patterns and their representation in a dimensional reduction space, we can utilize the plotGeneExpressionDimred function. This function allows us to plot the gene expression values of a specific gene on a dimensional reduction plot generated using methods like t-SNE, UMAP, or PCA. Each single cell is color-coded based on its expression level of the gene of interest.
In the provided example, we are visualizing the gene expression values of the gene “VPREB3” on a PCA plot. The PCA plot represents the cells in a lower-dimensional space, where the x-axis corresponds to the first principal component (Dimension 1) and the y-axis corresponds to the second principal component (Dimension 2). Each cell is represented as a point on the plot, and its color reflects the expression level of the gene “VPREB3,” ranging from low (lighter color) to high (darker color).
# Run PCA on the query data
query_data <- runPCA(query_data)
# Generate dimension reduction plot color code by gene expression
# plotGeneExpressionDimred(se_object = query_data,
# method = "PCA",
# pc_subset = c(1:5),
# feature = "VPREB3")
The dimensional reduction plot allows us to observe how the gene expression of VPREB3 is distributed across the cells and whether any clusters or patterns emerge in the data.
In addition to examining individual gene expression patterns, it is often useful to assess the collective activity of gene sets or pathways within single cells. This can provide insights into the functional states or biological processes associated with specific cell types or conditions. To facilitate this analysis, the scDiagnostics package includes a function called plotGeneSetScores that enables the visualization of gene set or pathway scores on a dimensional reduction plot.
The plotGeneSetScores function allows you to plot gene set or pathway scores on a dimensional reduction plot generated using methods such as PCA, t-SNE, or UMAP. Each single cell is color-coded based on its scores for specific gene sets or pathways. This visualization helps identify the heterogeneity and patterns of gene set or pathway activity within the dataset, potentially revealing subpopulations with distinct functional characteristics.
# Compute scores using AUCell
expression_matrix <- assay(query_data, "logcounts")
cells_rankings <- AUCell_buildRankings(expression_matrix, plotStats = FALSE)
# Generate gene sets
gene_set1 <- sample(rownames(expression_matrix), 10)
gene_set2 <- sample(rownames(expression_matrix), 20)
gene_sets <- list(geneSet1 = gene_set1,
geneSet2 = gene_set2)
# Calculate AUC scores for gene sets
cells_AUC <- AUCell_calcAUC(gene_sets, cells_rankings)
# Assign scores to colData
colData(query_data)$geneSetScores <- assay(cells_AUC)["geneSet1", ]
# Plot gene set scores on PCA
plotGeneSetScores(se_object = query_data,
method = "PCA",
feature = "geneSetScores",
pc_subset = c(1:6))
In the provided example, we demonstrate the usage of the plotGeneSetScores function using the AUCell package to compute gene set or pathway scores. Custom gene sets are generated for demonstration purposes, but users can provide their own gene set scores using any method of their choice. It is important to ensure that the scores are assigned to the colData of the reference or query object and specify the correct feature name for visualization.
By visualizing gene set or pathway scores on a dimensional reduction plot, you can gain a comprehensive understanding of the functional landscape within your single-cell gene expression dataset and explore the relationships between gene set activities and cellular phenotypes.
We are assessing the similarity or alignment between two datasets, the reference dataset, and the query dataset, in terms of highly variable genes (HVGs). We calculate the overlap coefficient between the sets of highly variable genes in the reference and query datasets. The overlap coefficient quantifies the degree of overlap or similarity between these two sets of genes. A value closer to 1 indicates a higher degree of overlap, while a value closer to 0 suggests less overlap. The computed overlap coefficient is printed, providing a numerical measure of how well the highly variable genes in the reference and query datasets align. In this case, the overlap coefficient is 0.63, indicating a moderate level of overlap.
# Get top HVG genes on each dataset
ref_var <- getTopHVGs(ref_data, n = 2000)
query_var <- getTopHVGs(query_data, n = 2000)
# Compute the overlap coefficient
overlap_coefficient <- calculateHVGOverlap(reference_genes = ref_var,
query_genes = query_var)
overlap_coefficient
#> [1] 0.63
The calculateVarImpOverlap function in this package is designed to identify and compare the most important genes for differentiating cell types between a query dataset and a reference dataset using the Random Forest algorithm. This comparison helps in understanding the overlap and consistency of key gene markers across different datasets.
This function applies the Random Forest algorithm to compute the importance of genes in differentiating between cell types within both a reference dataset and a query dataset. It then compares the top genes identified in both datasets to assess the overlap in their importance scores. This can be particularly useful for researchers aiming to validate their findings across different single-cell RNA-seq datasets, ensuring the robustness of identified markers and improving the reliability of subsequent analyses and annotations.
# Intersect the gene symbols to obtain common genes
common_genes <- intersect(ref_var, query_var)
# Select desired cell types
selected_cell_types <- c("CD4", "CD8", "B_and_plasma")
ref_data_subset <- ref_data[common_genes, ref_data$reclustered.broad %in% selected_cell_types]
query_data_subset <- query_data[common_genes, query_data$labels %in% selected_cell_types]
# Compute important variables for all pairwise cell comparisons
var_imp_overlap <- calculateVarImpOverlap(ref_data_subset,
query_data_subset,
ref_cell_type_col = "reclustered.broad",
query_cell_type_col = "labels",
n_top = 50)
# Comparison table
var_imp_overlap$var_imp_comparison
#> B_and_plasma-CD8 B_and_plasma-CD4 CD8-CD4
#> 0.80 0.78 0.76
The resulting overlap in the top genes’ importance scores for differentiating CD4+ T cells from CD8+ T cells between the reference and query datasets is 0.72. This implies that only 72% of the top 50 genes identified as important in distinguishing these two cell types in the reference dataset are also identified as important in the query dataset.
This moderate overlap indicates some level of consistency in the gene markers identified across the datasets, but it also highlights potential differences in the biological or technical conditions between the datasets. Factors such as batch effects, differences in sample preparation, or inherent biological variability could contribute to these differences. Consequently, while some key markers for CD4+ and CD8+ T cells are reliably detected in both datasets, additional validation and possibly further investigation into the causes of variability are warranted to ensure robust and accurate cell type differentiation. This finding underscores the importance of cross-dataset comparisons to validate and refine marker genes in single-cell transcriptomic studies.
This function performs Multidimensional Scaling (MDS) analysis on the query and reference datasets to examine their similarity. The dissimilarity matrix is calculated based on the correlation between the datasets, representing the distances between cells in terms of gene expression patterns. MDS is then applied to derive low-dimensional coordinates for each cell. Subsequently, a scatter plot is generated, where each data point represents a cell, and cell types are color-coded using custom colors provided by the user. This visualization enables the comparison of cell type distributions between the query and reference datasets in a reduced-dimensional space.
The rationale behind this function is to visually assess the alignment and relationships between cell types in the query and reference datasets.
# Generate the MDS scatter plot with cell type coloring
visualizeCellTypeMDS(query_data = query_data_subset,
reference_data = ref_data_subset,
query_cell_type_col = "labels",
ref_cell_type_col = "reclustered.broad")
Upon examining the MDS scatter plot, we observe that the CD4 and CD8 cell types overlap to some extent.By observing the proximity or overlap of different cell types, one can gain insights into their potential relationships or shared characteristics.
The selection of custom genes and desired cell types depends on the user’s research interests and goals. It allows for flexibility in focusing on specific genes and examining particular cell types of interest in the visualization.
The visualizeCellTypePCA function is designed to provide a clear visual representation of the principal components for different cell types in both query and reference datasets. This visualization helps in comparing the cell type distributions and identifying potential differences or similarities between datasets.
This function projects the query dataset onto the principal component space of the reference dataset. It then visualizes the specified principal components for the specified cell types using ggplot2 for creating plots. By default, the function considers the first ten principal components and allows the user to select specific components for detailed visualization. The function will project the query dataset onto the PCA space of the reference dataset and generate scatter plots for pairs of principal components (e.g., PC1 vs. PC2, PC1 vs. PC3, etc.).
When interpreting these plots, one might observe distinct clustering of cell types, indicating clear differentiation in the principal component space. For instance, if CD4+ and CD8+ T cells form separate clusters, it suggests that the PCA effectively captures the differences between these cell types. Conversely, if the clusters overlap significantly, it could imply that the chosen principal components do not fully capture the distinctions between these cell types.
# Run PCA on the reference data
ref_data_subset <- runPCA(ref_data_subset)
# Plot the PCs data
visualizeCellTypePCA(query_data = query_data_subset, reference_data = ref_data_subset,
query_cell_type_col = "reclustered.broad",
ref_cell_type_col = "reclustered.broad",
pc_subset = c(1:6))
The B and plasma cell types from the reference and query datasets do not overlap well in the PCA space. This lack of overlap could indicate discrepancies between the datasets, such as poor classification by SingleR, differences in data quality, technical artifacts, or biological variations that were not accounted for.
The boxplotPCA function provides a detailed visualization of principal component analysis (PCA) results for different cell types across two datasets: query and reference. This function generates boxplots of principal components (PCs), allowing for comparative analysis of the distributions of the PCs across various cell types and datasets.
These boxplots allow us to observe the distribution of each principal component for each cell type across the reference and query datasets. If the boxplots for a given cell type overlap significantly between the reference and query datasets, it suggests that the PCA results are consistent across these datasets. This overlap indicates that the cell type characteristics are well captured and comparable between the datasets.
# Run PCA on the reference data
ref_data_subset <- runPCA(ref_data_subset)
# Boxplots of PCs
boxplotPCA(query_data = query_data_subset, reference_data = ref_data_subset,
query_cell_type_col = "labels",
ref_cell_type_col = "reclustered.broad",
pc_subset = c(1:6))
The boxplots for the different cell types from the reference and query datasets do not overlap well. This lack of overlap could indicate discrepancies between the datasets, such as differences poor classification by SingleR, differences in data quality, technical artifacts, or biological variations that were not accounted for. This outcome might prompt further investigation into the preprocessing steps or the inherent variability within these cell types, potentially leading to refinements in data normalization or correction for batch effects to achieve better alignment.
p_values <- calculateHotellingPValue(query_data = query_data_subset,
reference_data = ref_data_subset,
query_cell_type_col = "labels",
ref_cell_type_col = "reclustered.broad",
pc_subset = c(1:6))
round(p_values, 5)
#> B_and_plasma CD8 CD4
#> 0.12555 0.00000 0.00000
Performing PC regression analysis on a SingleCellExperiment object enables users to examine the relationship between a principal component (PC) from the dimension reduction slot and an independent variable of interest. By specifying the desired dependent variable as one of the principal components (e.g., “PC1”, “PC2”, etc.) and providing the corresponding independent variable from the colData slot (e.g. “cell_type”), users can explore the associations between linear structure in the single-cell gene expression dataset (reference and query) and an independent variable of interest (e.g. cell type or batch).
The function prints two diagnostic plots by default:
# Specify the dependent variables (principal components) and
# independent variable (e.g., "labels")
dep.vars <- paste0("PC", 1:12)
indep.var <- "labels"
# Perform linear regression on multiple principal components
result <- regressPC(sce = query_data,
dep.vars = dep.vars,
indep.var = indep.var)
# Print the summaries of the linear regression models and R-squared
# values
# Summaries of the linear regression models
result$regression.summaries[[1]]
#>
#> Call:
#> lm(formula = f, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -8.8247 -2.3014 -0.5169 2.0392 11.8830
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -9.2304 0.2700 -34.18 <2e-16 ***
#> IndependentCD4 6.0102 0.3242 18.54 <2e-16 ***
#> IndependentCD8 15.0112 0.3133 47.91 <2e-16 ***
#> IndependentMyeloid 8.5967 0.5901 14.57 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3.318 on 965 degrees of freedom
#> Multiple R-squared: 0.7447, Adjusted R-squared: 0.7439
#> F-statistic: 938.4 on 3 and 965 DF, p-value: < 2.2e-16
# R-squared values
result$rsquared
#> PC1 PC2 PC3 PC4 PC5 PC6
#> 7.447307e-01 5.960960e-01 3.981486e-01 1.310660e-01 4.643732e-01 6.531101e-02
#> PC7 PC8 PC9 PC10 PC11 PC12
#> 1.442437e-01 1.570605e-03 5.569864e-03 3.624826e-03 8.987037e-05 1.299829e-03
# Variance contributions for each principal component
result$var.contributions
#> PC1 PC2 PC3 PC4 PC5 PC6
#> 8.3570722799 3.0668661511 1.1828628537 0.2984864031 0.9357801781 0.0908593046
#> PC7 PC8 PC9 PC10 PC11 PC12
#> 0.1570425702 0.0011816408 0.0040858661 0.0025194952 0.0000551812 0.0007429236
# Total variance explained
result$total.variance.explained
#> [1] 14.09755
This analysis helps uncover whether there is a systematic variation in PC values across different cell types. In the example above, we can see that the four cell types are spread out across both PC1 and PC2. Digging into the genes with high loadings on these PCs can help explain the biological or technical factors driving cellular heterogeneity. It can help identify PC dimensions that capture variation specific to certain cell types or distinguish different cellular states.
Let’s look at the genes driving PC1 by ordering the rotation matrix by the absolute gene loadings for PC1:
pc_df <- attr(reducedDims(query_data)$PCA, "rotation")[, 1:5] |>
as.data.frame()
pc_df[order(abs(pc_df$PC1)), "PC1", drop = FALSE] |>
tail()
#> PC1
#> GZMA 0.1517186
#> GNLY 0.1552411
#> CST7 0.1665665
#> CCL4 0.1686632
#> CCL5 0.1815465
#> NKG7 0.2286197
PC1 is mostly driven by NKG7 - Natural Killer Cell Granule Protein 7. This gene is important in CD8+ T cells, so that makes sense that it’s distinguishing the cell types shown.
This analysis aims to explore the correlation patterns between different cell types in a single-cell gene expression dataset. The goal is to compare the gene expression profiles of cells from a reference dataset and a query dataset to understand the relationships and similarities between various cell types.
To perform the analysis, we start by computing the pairwise correlations between the query and reference cells for selected cell types (“CD4”, “CD8”, “B_and_plasma”). The Spearman correlation method is used, user can also use Pearsons correlation coefficient.
cor_matrix_avg <- calculateAveragePairwiseCorrelation(query_data = query_data_subset,
reference_data = ref_data_subset,
query_cell_type_col = "labels",
ref_cell_type_col = "reclustered.broad",
cell_types = selected_cell_types,
correlation_method = "spearman")
# Visualize the output
plot(cor_matrix_avg)
In this case, users have the flexibility to extract the gene expression profiles of specific cell types from the reference and query datasets and provide these profiles as input to the function. Additionally, they can select their own set of genes that they consider relevant for computing the pairwise correlations. For demonstration we have used common highly variable genes from reference and query dataset.
By providing their own gene expression profiles and choosing specific genes, users can focus the analysis on the cell types and genes of interest to their research question.
This function serves to conduct a analysis of pairwise distances or correlations between cells of specific cell types within a single-cell gene expression dataset. By calculating these distances or correlations, users can gain insights into the relationships and differences in gene expression profiles between different cell types. The function facilitates this analysis by generating density plots, allowing users to visualize the distribution of distances or correlations for various pairwise comparisons.
The analysis offers the flexibility to select a particular cell type for examination, and users can choose between different distance metrics, such as “euclidean” or “manhattan,” to calculate pairwise distances.
To illustrate, the function is applied to the cell type CD8 using the euclidean distance metric in the example below.
plotPairwiseDistancesDensity(query_data = query_data_subset,
reference_data = ref_data_subset,
query_cell_type_col = "labels",
ref_cell_type_col = "reclustered.broad",
cell_type_query = "CD4",
cell_type_reference = "CD4",
distance_metric = "euclidean")
Alternatively, users can opt for the “correlation” distance metric, which measures the similarity in gene expression profiles between cells.
To illustrate, the function is applied to the cell type CD4 using the correlation distance metric in the example below. By selecting either the “pearson” or “spearman” correlation method, users can emphasize either linear or rank-based associations, respectively.
plotPairwiseDistancesDensity(query_data = query_data_subset,
reference_data = ref_data_subset,
query_cell_type_col = "labels",
ref_cell_type_col = "reclustered.broad",
cell_type_query = "CD4",
cell_type_reference = "CD4",
distance_metric = "correlation",
correlation_method = "spearman")
By utilizing this function, users can explore the pairwise distances between query and reference cells of a specific cell type and gain insights into the distribution of distances through density plots. This analysis aids in understanding the similarities and differences in gene expression profiles for the selected cell type within the query and reference datasets.
The calculateNearestNeighborProbabilities function is designed to compute the probabilities that each sample in a query dataset belongs to either the reference or the query dataset for each cell type using nearest neighbor analysis. This function is particularly useful for evaluating the consistency of cell type classification across datasets.
This function projects the query data onto the PCA space of the reference data, balances the sample sizes between the reference and query datasets through data augmentation if necessary, and calculates the probability of each cell in the query dataset belonging to the query dataset for each cell type using a nearest neighbor search.
Ideally, the probability distributions for each cell type would center around 0.5, indicating an equal likelihood of belonging to either the reference or query dataset. This balanced probability suggests that the query and reference datasets are well-matched for that cell type. Deviations from this ideal, such as bimodal distributions, may indicate that the query dataset contains additional, unaccounted-for cell types or subpopulations within that particular cell type. Such patterns can highlight the need for further investigation into the composition and quality of the query dataset, possibly revealing the presence of unexpected or mislabeled cells.
nn_output <- calculateNearestNeighborProbabilities(query_data = query_data_subset,
reference_data = ref_data_subset,
query_cell_type_col = "labels",
ref_cell_type_col = "reclustered.broad")
# Plot output
plot(nn_output)
The nearest neighbor diagnostics for CD4+ T cells yield a bimodal probability distribution, this suggests that the query dataset contains distinct subpopulations or potentially mislabeled cells within the CD4+ T cell category. Ideally, the probability distribution should center around 0.5, indicating an equal likelihood of belonging to either the reference or query dataset. A bimodal distribution, however, indicates that some CD4+ T cells in the query dataset are very similar to those in the reference, while others are distinctly different.
The detectAnomaly function is designed to identify anomalies in single-cell RNA sequencing (scRNA-seq) data by leveraging principal component analysis (PCA) and the isolation forest algorithm. The main purpose of this function is to project both reference and query datasets onto a common PCA space, build an isolation forest on the reference data, and use it to detect anomalies in the query data based on their PCA projections. If a query dataset is not provided, the function computes anomaly scores for the reference data itself. This process helps in distinguishing cells that deviate significantly from the expected patterns within a given cell type, thereby highlighting potential errors, rare cell types, or other interesting biological variations.
The function returns a comprehensive output, including anomaly scores for each cell, logical vectors indicating whether a cell is classified as an anomaly, and the PCA projections of both reference and query datasets. Additionally, it provides the proportion of variance explained by the retained principal components. To interpret the results, anomaly scores closer to the specified threshold (default 0.5) suggest cells that are considered borderline anomalies. Scores significantly above this threshold indicate strong anomalies. By examining these scores and their distributions, users can identify and investigate outlier cells, ensuring data quality and uncovering potential biological insights.
# Store PCA anomaly data
anomaly_output <- detectAnomaly(reference_data = ref_data_subset,
query_data = query_data_subset,
ref_cell_type_col = "reclustered.broad",
query_cell_type_col = "labels")
# Plot the anomaly output for a cell type
# plot(anomaly_output, cell_type = "CD4", pc_subset = c(1:5), data_type = "query")
The high frequency of anomalies within CD4+ T cells could suggest the presence of annotation errors or rare CD4+ T cells sub-populations. This capability allows researchers to not only assess the quality of their datasets but also uncover potential biological insights that might otherwise go unnoticed.
The calculateSampleDistances function is designed to analyze the spatial relationships within and between single-cell RNA sequencing (scRNA-seq) datasets by computing distances within a reference dataset and from query samples to reference samples. Using PCA for dimensionality reduction and Euclidean distance for the actual distance calculations, this function helps in understanding how closely the query samples resemble the reference data in the reduced PCA space.
When the resulting distance metrics are visualized, particularly if a significant spread or clustering of distances is observed, it can provide insights into the similarity or dissimilarity between the datasets. For instance, large distances from query samples to reference samples could indicate that the query data contains different or novel cell types not well-represented in the reference dataset. Conversely, small distances suggest that the query samples closely match the reference cell types, implying a good overlap between the datasets. This functionality is crucial for assessing annotation quality, quality control, identification of outliers, and further biological interpretation, particularly when analyzing heterogeneous cell populations.
# Compute sample distance data
distance_data <- calculateSampleDistances(query_data = query_data_subset, reference_data = ref_data_subset,
query_cell_type_col = "labels",
ref_cell_type_col = "reclustered.broad")
# Store top six anomalies for CD4
cd4_top6_anomalies <- names(sort(anomaly_output$CD4$query_anomaly_scores, decreasing = TRUE)[1:6])
# Plot the densities of the distances for the anomalies
plot(distance_data, ref_cell_type = "CD4", sample_names = cd4_top6_anomalies)
# Plot the densities of the distances for the anomalies
plot(distance_data, ref_cell_type = "CD8", sample_names = cd4_top6_anomalies)
When using the detectAnomaly function to identify the top 6 anomalies in CD4+ T cells, a subsequent analysis reveals that these anomalous cells exhibit minimal overlap with the reference CD4+ T cell data. Instead, they show a significantly larger overlap with the reference data for CD8+ T cells. This pattern suggests that the anomalous CD4+ T cells may possess characteristics more akin to CD8+ T cells, possibly indicating a misclassification or a transitional cell state. This finding could have important biological implications, such as uncovering a subset of CD4+ T cells undergoing differentiation into CD8+ T cells, or identifying a distinct cell population that shares features of both T cell types. Such insights are critical for understanding cell identity and plasticity in immune responses, highlighting the utility of anomaly detection in single-cell RNA sequencing data analysis.
This function calculates Bhattacharyya coefficients and Hellinger distances to quantify the similarity between density distributions of query samples and reference data.
# Get overlap measures
overlap_measures <- calculateSampleDistancesSimilarity(query_data = query_data_subset, reference_data = ref_data_subset,
sample_names = cd4_top6_anomalies,
query_cell_type_col = "labels",
ref_cell_type_col = "reclustered.broad")
overlap_measures$bhattacharyya_coef
#> Sample B_and_plasma CD8 CD4
#> 1 GCATGCGCATCACGAT-1 0.5059811 0.2791036 0.2780796
#> 2 CTCATTAAGATGTGTA-1 0.4673593 0.4402312 0.3126375
#> 3 TCTATTGCAATAAGCA-1 0.4226011 0.4742580 0.4541647
#> 4 GTTACAGCAGCTGTGC-1 0.4980591 0.1517150 0.1516096
#> 5 GGACATTGTTGAACTC-1 0.4744637 0.4765461 0.3657990
#> 6 GACTAACGTCGAACAG-1 0.3016021 0.7238643 0.8293373
overlap_measures$hellinger_dist
#> Sample B_and_plasma CD8 CD4
#> 1 GCATGCGCATCACGAT-1 0.7028648 0.8490562 0.8496590
#> 2 CTCATTAAGATGTGTA-1 0.7298224 0.7481770 0.8290733
#> 3 TCTATTGCAATAAGCA-1 0.7598677 0.7250807 0.7388067
#> 4 GTTACAGCAGCTGTGC-1 0.7084779 0.9210239 0.9210811
#> 5 GGACATTGTTGAACTC-1 0.7249388 0.7235012 0.7963674
#> 6 GACTAACGTCGAACAG-1 0.8357020 0.5254861 0.4131135
When employing the calculateSampleDistancesSimilarity function, the anomalous CD4+ T cells identified using detectAnomaly demonstrate intriguing patterns in their similarity measures. Specifically, these anomalous CD4+ T cells exhibit Bhattacharyya coefficients closest to 1 when compared to the reference CD8+ T cells. This high Bhattacharyya coefficient indicates a significant similarity between the distribution of these anomalous CD4+ T cells and the reference CD8+ T cell population.
Furthermore, the Hellinger distance for these anomalous CD4+ T cells, when measured against the CD8+ T cell reference data, is closest to 0. The low Hellinger distance reinforces the observation that these anomalous CD4+ T cells share a high degree of similarity with CD8+ T cells. Such findings suggest that these CD4+ T cells might be misclassified, potentially representing a transitional cell state or a subset with distinct characteristics that align more closely with CD8+ T cells. This nuanced analysis underscores the importance of combining anomaly detection with robust similarity measures to uncover hidden insights in single-cell RNA sequencing data.
This function aims to compute the cosine similarity between samples based on the PCs obtained from PCA loadings. Anomalous cells, often identified through anomaly detection methods, may exhibit substantial effects on PCs with small eigenvalues. This function facilitates the examination of how anomalous samples align with principal components, potentially revealing aberrant expression patterns and aiding in the identification of outlier cells in scRNA-seq datasets.
# Store PCA anomaly data and plots
anomaly_output <- detectAnomaly(reference_data = ref_data_subset,
query_data = query_data_subset,
ref_cell_type_col = "reclustered.broad",
query_cell_type_col = "labels")
top6_anomalies <- names(sort(anomaly_output$Combined$reference_anomaly_scores, decreasing = TRUE)[1:6])
# Compute cosine similarity between anomalies and PCs
cosine_similarities <- calculateSampleSimilarityPCA(ref_data_subset, samples = top6_anomalies,
pc_subset = c(1:50), n_top_vars = 50)
# Plot similarities
plot(cosine_similarities, pc_subset = c(15:25))
This function facilitates the projection of query single-cell RNA-seq data onto the discriminant space defined by reference data. By leveraging the discriminant space established using reference data, the function assesses the similarity between the query and reference data projections using cosine similarity and Mahalanobis distance.
The process involves several steps for each pairwise combination of cell types:
disc_output <- calculateDiscriminantSpace(reference_data = ref_data_subset,
query_data = query_data_subset,
query_cell_type_col = "labels",
ref_cell_type_col = "reclustered.broad")
# Generate scatterplot
plot(disc_output, plot_type = "scatterplot")
plot(disc_output, cell_types = "CD4-CD8", plot_type = "boxplot")
The scatterplot provides compelling evidence of an overlap between CD4+ and CD8+ cells, indicating a potential misannotation. Furthermore, the boxplot reveals a noticeable incongruity between the query and reference data in the discriminant space specifically for CD4+ T cells.
The comparePCA
function is designed to compare the principal
components (PCs) obtained from separate PCA analyses on reference and
query datasets for a specific cell type. It allows users to assess the
similarity between the PCs using either cosine similarity or correlation
metrics.
This function computes cosine similarities or correlations between the loadings of top variables for each pair of principal components from the reference and query datasets. It first extracts the PCA rotation matrices from both datasets and identifies the top variables with the highest loadings for each PC. Then, it computes the similarity values between the loadings of top variables for each pair of PCs. The resulting matrix contains similarity values, where rows represent reference PCs and columns represent query PCs.
# Selecting highly variable genes
ref_var <- getTopHVGs(ref_data, n = 500)
query_var <- getTopHVGs(query_data, n = 500)
# Intersect the gene symbols to obtain common genes
common_genes <- intersect(ref_var, query_var)
ref_data_subset <- ref_data[common_genes, ]
query_data_subset <- query_data[common_genes, ]
# Subset reference and query data for a specific cell type
ref_data_subset <- ref_data_subset[, which(ref_data_subset$reclustered.broad == "CD4")]
query_data_subset <- query_data_subset[, which(query_data_subset$reclustered.broad == "CD4")]
# Run PCA on the reference and query datasets separately
ref_data_subset <- runPCA(ref_data_subset, ncomponents = 50)
query_data_subset <- runPCA(query_data_subset, ncomponents = 50)
# Call the PCA comparison function
similarity_mat <- comparePCA(reference_data = ref_data_subset, query_data = query_data_subset,
pc_subset = c(1:5),
n_top_vars = 50,
metric = c("cosine", "correlation")[1],
correlation_method = c("spearman", "pearson")[1])
# Create the heatmap
plot(similarity_mat)
The comparePCA function provides insights into the consistency or discrepancy between the PCA results of reference and query datasets. By comparing the similarity matrix generated, users can discern patterns of similarity or dissimilarity in the principal components across datasets. This insight can help evaluate the quality of the reference dataset and the effectiveness of the PCA analysis in capturing similar structures in the query dataset.
This function first computes the cosine similarity between the loadings of the top variables for each PC in both the reference and query datasets. It then selects the top cosine similarity scores and their corresponding PC indices. Additionally, the function calculates the average percentage of variance explained by the selected top PCs. Finally, it computes a weighted cosine similarity score based on the top cosine similarities and the average percentage of variance explained.
The comparePCASubspace function provides insights into the similarity between the subspaces spanned by the top PCs of the reference and query datasets. By comparing the weighted cosine similarity score, users can assess how closely aligned the subspaces are. This insight is valuable for understanding the consistency of the principal components between datasets and can aid in assessing the quality of data integration or comparative analysis.
subspace_comparison <- comparePCASubspace(reference_data = ref_data_subset, query_data = query_data_subset,
pc_subset = c(1:5))
# Plot output for PCA subspace comparison
plot(subspace_comparison)
The compareCCA function is a versatile tool designed to compare the subspaces defined by the top principal components (PCs) in two distinct datasets using Canonical Correlation Analysis (CCA). Its primary functionality lies in assessing the similarity between the structures represented by these PCs, providing researchers with a quantitative measure of the concordance between high-dimensional datasets. By focusing on the top loading variables for each PC, the function allows for a targeted examination of the most informative features driving the observed variation, enhancing the interpretability of the comparative analysis.
cca_comparison <- compareCCA(reference_data = ref_data_subset, query_data = query_data_subset,
pc_subset = c(1:5))
# Create a data frame for plotting
plot(cca_comparison)
In this analysis, we have demonstrated the capabilities of the scDiagnostics package for assessing the appropriateness of cell assignments in single-cell gene expression profiles. By utilizing various diagnostic functions and visualization techniques, we have explored different aspects of the data, including total UMI counts, annotation scores, gene expression distributions, dimensional reduction plots, gene set scores, pairwise correlations, pairwise distances, and linear regression analysis.
Through the scatter plots, histograms, and dimensional reduction plots, we were able to gain insights into the relationships between gene expression patterns, cell types, and the distribution of cells in a reduced-dimensional space. The examination of gene expression distributions, gene sets, and pathways allowed us to explore the functional landscape and identify subpopulations with distinct characteristics within the dataset. Additionally, the pairwise correlation and distance analyses provided a deeper understanding of the similarities and differences between cell types, highlighting potential relationships and patterns.
R version 4.4.0 RC (2024-04-16 r86468)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] celldex_1.15.0 AUCell_1.27.0
[3] SingleR_2.7.1 scran_1.33.0
[5] scater_1.33.2 ggplot2_3.5.1
[7] scuttle_1.15.0 scRNAseq_2.19.0
[9] SingleCellExperiment_1.27.2 SummarizedExperiment_1.35.0
[11] Biobase_2.65.0 GenomicRanges_1.57.1
[13] GenomeInfoDb_1.41.1 IRanges_2.39.0
[15] S4Vectors_0.43.0 BiocGenerics_0.51.0
[17] MatrixGenerics_1.17.0 matrixStats_1.3.0
[19] scDiagnostics_0.99.6 BiocStyle_2.33.1
loaded via a namespace (and not attached):
[1] BiocIO_1.15.0 bitops_1.0-7
[3] filelock_1.0.3 tibble_3.2.1
[5] R.oo_1.26.0 graph_1.83.0
[7] XML_3.99-0.16.1 lifecycle_1.0.4
[9] httr2_1.0.1 edgeR_4.3.4
[11] isotree_0.6.1-1 lattice_0.22-6
[13] ensembldb_2.29.0 alabaster.base_1.5.3
[15] magrittr_2.0.3 limma_3.61.2
[17] sass_0.4.9 rmarkdown_2.27
[19] jquerylib_0.1.4 yaml_2.3.8
[21] metapod_1.13.0 RColorBrewer_1.1-3
[23] DBI_1.2.3 abind_1.4-5
[25] zlibbioc_1.51.1 R.utils_2.12.3
[27] AnnotationFilter_1.29.0 RCurl_1.98-1.14
[29] rappdirs_0.3.3 GenomeInfoDbData_1.2.12
[31] ggrepel_0.9.5 irlba_2.3.5.1
[33] alabaster.sce_1.5.1 annotate_1.83.0
[35] dqrng_0.4.1 DelayedMatrixStats_1.27.1
[37] codetools_0.2-20 DelayedArray_0.31.3
[39] tidyselect_1.2.1 UCSC.utils_1.1.0
[41] farver_2.1.2 ScaledMatrix_1.13.0
[43] viridis_0.6.5 BiocFileCache_2.13.0
[45] GenomicAlignments_1.41.0 jsonlite_1.8.8
[47] BiocNeighbors_1.23.0 tools_4.4.0
[49] Rcpp_1.0.12 glue_1.7.0
[51] gridExtra_2.3 SparseArray_1.5.10
[53] ranger_0.16.0 xfun_0.45
[55] dplyr_1.1.4 HDF5Array_1.33.3
[57] gypsum_1.1.6 withr_3.0.0
[59] BiocManager_1.30.23 fastmap_1.2.0
[61] rhdf5filters_1.17.0 bluster_1.15.0
[63] fansi_1.0.6 digest_0.6.35
[65] rsvd_1.0.5 Hotelling_1.0-8
[67] R6_2.5.1 colorspace_2.1-0
[69] RSQLite_2.3.7 R.methodsS3_1.8.2
[71] utf8_1.2.4 generics_0.1.3
[73] corpcor_1.6.10 data.table_1.15.4
[75] rtracklayer_1.65.0 httr_1.4.7
[77] S4Arrays_1.5.1 pkgconfig_2.0.3
[79] gtable_0.3.5 blob_1.2.4
[81] XVector_0.45.0 htmltools_0.5.8.1
[83] bookdown_0.39 GSEABase_1.67.0
[85] ProtGenerics_1.37.0 scales_1.3.0
[87] alabaster.matrix_1.5.4 png_0.1-8
[89] knitr_1.47 rjson_0.2.21
[91] curl_5.2.1 cachem_1.1.0
[93] rhdf5_2.49.0 BiocVersion_3.20.0
[95] parallel_4.4.0 vipor_0.4.7
[97] AnnotationDbi_1.67.0 restfulr_0.0.15
[99] pillar_1.9.0 grid_4.4.0
[101] alabaster.schemas_1.5.0 vctrs_0.6.5
[103] BiocSingular_1.21.1 dbplyr_2.5.0
[105] beachmat_2.21.3 xtable_1.8-4
[107] cluster_2.1.6 beeswarm_0.4.0
[109] evaluate_0.24.0 magick_2.8.3
[111] tinytex_0.51 GenomicFeatures_1.57.0
[113] cli_3.6.3 locfit_1.5-9.9
[115] compiler_4.4.0 Rsamtools_2.21.0
[117] rlang_1.1.4 crayon_1.5.3
[119] labeling_0.4.3 ggbeeswarm_0.7.2
[121] alabaster.se_1.5.2 viridisLite_0.4.2
[123] BiocParallel_1.39.0 munsell_0.5.1
[125] Biostrings_2.73.1 lazyeval_0.2.2
[127] Matrix_1.7-0 ExperimentHub_2.13.0
[129] sparseMatrixStats_1.17.2 bit64_4.0.5
[131] Rhdf5lib_1.27.0 KEGGREST_1.45.1
[133] statmod_1.5.0 alabaster.ranges_1.5.2
[135] highr_0.11 AnnotationHub_3.13.0
[137] igraph_2.0.3 memoise_2.0.1
[139] bslib_0.7.0 bit_4.0.5