coFAST: NSCLC CosMx data coembedding

Wei Liu

2025-11-12

This vignette introduces the coFAST workflow for the analysis of NSCLC CosMx spatial transcriptomics dataset. In this vignette, the workflow of coFAST consists of three steps

Load and view data

We demonstrate the use of coFAST to NSCLC data, which can be downloaded to the current working path by the following command:

set.seed(2024) # set a random seed for reproducibility.
library(coFAST) # load the package of coFAST method
library(Seurat)
data(CosMx_subset)
CosMx_subset

Preprocessing

First, we normalize the data.

CosMx_subset <- NormalizeData(CosMx_subset)

Then, we select the variable genes.

CosMx_subset <- FindVariableFeatures(CosMx_subset)

Coembedding using coFAST

We introduce how to use coFAST to perform coembedding for this CosMx data. First, we determine the dimension of coembeddings. Then, we select the variable genes.

dat_cor <- diagnostic.cor.eigs(CosMx_subset)
q_est <- attr(dat_cor, "q_est")
cat("q_est = ", q_est, '\n')

Subsequently, we calculate coembeddings by utilizing coFAST, and observe that the reductions field acquires an additional component named fast.

pos <- as.matrix(CosMx_subset@meta.data[,c("x", "y")]) # Extract the spatial coordinates
Adj_sp <- AddAdj(pos) ## calculate the adjacency matrix
CosMx_subset <- coFAST(CosMx_subset, Adj_sp = Adj_sp, q = q_est)
CosMx_subset

Downstream analysis

Clustering and assess the cluster significance

First, we show how to use the spot embedding for spatial domain segmentation. User can set the number of clusters as fixed value, i.e., K=10; or set K=NULL using the default resolution. The cluster results are in the cofast.cluster column of data.frame meta.data slot. The Idents of Seurat object is also set as cofast.cluster.

CosMx_subset <- AddCluster(CosMx_subset)
print(head(CosMx_subset))
table(Idents(CosMx_subset))

We visualize the clusters on spatial coordinates.

CosMx_subset <- Addcoord2embed(CosMx_subset, coord.name = c("x", "y"))
print(CosMx_subset)
cols_cluster <- PRECAST::chooseColors(palettes_name = 'Classic 20', n_colors=20, plot_colors = TRUE)
# DimPlot(CosMx_subset, reduction='Spatial', cols=cols_cluster, pt.size = 2)
DimPlot(CosMx_subset, reduction='Spatial', group.by = c("cofast.cluster", 'cell_type'), cols=cols_cluster, pt.size = 1.5)

Next, we assess the aggregation scores of each cluster on spatial space and embedding space. We calculated the spatial aggregation scores and found that cluster 8 (corresponding to tumor) has the highest spatial aggregation score, which is concordance with the spatial clustering map.

dat.spa.score <- AggregationScore(CosMx_subset, reduction.name = 'Spatial')
print(dat.spa.score)

Next, we calculated the embedding aggregation scores and found that cluster 8 also has the highest embedding aggregation score.

dat.embd.score <- AggregationScore(CosMx_subset, reduction.name = 'cofast')
print(dat.embd.score)

In the following, we show how to find the signature genes based on comebeddings. First, we calculate the distance matrix.

CosMx_subset <- pdistance(CosMx_subset, reduction = "cofast")

Next, we find the signature genes for each cell type

print(table(Idents(CosMx_subset)))
#Idents(CosMx_subset) <- CosMx_subset$cell_type
df_sig_list <- find.signature.genes(CosMx_subset)
str(df_sig_list)

Then, we obtain the top five signature genes and organize them into a data.frame. Next, we calculate the UMAP projections of coembeddings. The colname distance means the distance between gene (i.e., IL7R) and cells with the specific cluster (i.e., cluster 1), which is calculated based on the coembedding of genes and cells in the coembedding space. The distance is smaller, the association between gene and the cell type is stronger. The colname expr.prop represents the expression proportion of the gene (i.e., IL7R) within the cell type (i.e., tumor). The colname label means the cluster and colname gene denotes the gene name. By the data.frame object, we know IL7R is the one of the top signature gene of cluster 1.

dat <- get.top.signature.dat(df_sig_list, ntop = 2, expr.prop.cutoff = 0.1)
head(dat)

Next, we calculate the UMAP projections of coembeddings of cells and the selected signature genes.

CosMx_subset <- coembedding_umap(
  CosMx_subset, reduction = "cofast", reduction.name = "UMAP",
  gene.set = unique(dat$gene))

Furthermore, we visualize the cells and top two signature genes of tumor in the UMAP space of coembedding. We observe that the UMAP projections of the two signature genes are near to tumor, which indicates these genes are enriched in tumor.

## choose beutifual colors
cols_cluster2 <- c("black", cols_cluster)
p1 <- coembed_plot(
   CosMx_subset, reduction = "UMAP",
   gene_txtdata = subset(dat, label=='8'),
   cols=cols_cluster2, pt_text_size = 3)
p1

Then, we visualize the cells and top two signature genes of all involved cell types in the UMAP space of coembedding. We observe that the UMAP projections of the signature genes are near to the corresponding cell type, which indicates these genes are enriched in the corresponding cells.

p2 <- coembed_plot(
   CosMx_subset, reduction = "UMAP",
   gene_txtdata = dat, cols=cols_cluster2, 
   pt_text_size = 3, alpha=0.2)
p2

In addtion, we can fully take advantages of the visualization functions in Seurat package for visualization. The following is an example that visualizes the cell types on the UMAP space.

DimPlot(CosMx_subset, reduction = 'UMAP', cols=cols_cluster)

Then, there is another example that we plot the first two signature genes of Tumor 5 on UMAP space, in which we observed the high expression in tumor in constrast to other cell types.

FeaturePlot(CosMx_subset, reduction = 'UMAP', features = c("PSCA", "CEACAM6"))
Session Info
sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=C                               
#> [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
#> [3] LC_MONETARY=Chinese (Simplified)_China.utf8
#> [4] LC_NUMERIC=C                               
#> [5] LC_TIME=Chinese (Simplified)_China.utf8    
#> 
#> time zone: Asia/Shanghai
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.37     R6_2.5.1          fastmap_1.2.0     xfun_0.47        
#>  [5] cachem_1.1.0      knitr_1.48        htmltools_0.5.8.1 rmarkdown_2.28   
#>  [9] lifecycle_1.0.4   cli_3.6.3         sass_0.4.9        jquerylib_0.1.4  
#> [13] compiler_4.4.1    rstudioapi_0.16.0 tools_4.4.1       evaluate_1.0.0   
#> [17] bslib_0.8.0       yaml_2.3.10       rlang_1.1.4       jsonlite_1.8.9