How to create supercells

0.1 Introduction
0.2 Installation
0.3 Preparing your dataset
0.4 Creating supercells
0.5 Running runSuperCellCyto in parallel
0.6 Controlling supercells granularity
- 0.6.1 Adjusting gamma value after one run of runSuperCellCyto
- 0.6.2 Specifying different gamma value for different samples
0.7 Mixing cells from different samples in a supercell
0.8 I have more cells than RAM in my computer
0.9 Session information

0.1 Introduction

This vignette describes the steps to generate supercells for cytometry data using SuperCellCyto R package.

Briefly, supercells are “mini” clusters of cells that are similar in their marker expressions. The motivation behind supercells is that instead of analysing millions of individual cells, you can analyse thousands of supercells, making downstream analysis much faster while maintaining biological interpretability.

See other vignettes for how to:

0.2 Installation

You can install stable version of SuperCellCyto from Bioconductor using:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SuperCellCyto")

For the latest development version, you can install it from GitHub using pak:

if (!requireNamespace("pak", quietly = TRUE))
    install.packages("pak")
pak::install_github("phipsonlab/SuperCellCyto")

0.3 Preparing your dataset

The function which creates supercells is called runSuperCellCyto, and it operates on a data.table object, an enhanced version of R native data.frame.

In addition to needing the data stored in a data.table object it also requires:

The markers you will be using to create supercells to have been appropriately transformed, typically using either arcsinh transformation or linear binning (using FlowJo). runSuperCellCyto does not perform any data transformation or scaling.
The object to have a column denoting the unique ID of each cell. You most likely have to create this column yourself, and it can simply just be a numerical value ranging from 1 to however many cells you have in your data.
The object to have a column denoting the biological sample each cell comes from. This column is critical to ensure that cells from different samples will not be mixed in a supercell.

If you are not sure how to import CSV or FCS files into data.table object, and/or how to subsequently prepare the object ready for SuperCellCyto, please consult this vignette. In that vignette, we also provide an explanation behind why we need to have the cell ID and sample column.

For this vignette, we will simulate some toy data using the simCytoData function. Specifically, we will simulate 15 markers and 3 samples, with each sample containing 10,000 cells. Hence in total, we will have a toy dataset containing 15 markers and 30,000 cells.

n_markers <- 15
n_samples <- 3
dat <- simCytoData(nmarkers = n_markers, ncells = rep(10000, n_samples))
head(dat)
#>    Marker_1  Marker_2 Marker_3 Marker_4 Marker_5 Marker_6 Marker_7 Marker_8
#>       <num>     <num>    <num>    <num>    <num>    <num>    <num>    <num>
#> 1: 14.11156  9.598622 11.86061 18.00292 18.87602 12.06980 13.33100 19.32722
#> 2: 17.87428 10.418514 12.72185 17.49065 19.15847 12.48764 11.78764 19.35367
#> 3: 16.51021  9.224557 10.96162 18.69124 17.25020 14.75426 13.66378 20.72191
#> 4: 16.48595 10.655714 11.23217 18.70994 20.31709 13.45817 11.75610 20.20985
#> 5: 16.75065  8.957354 13.46487 18.76156 19.03541 13.71786 13.42411 17.87916
#> 6: 15.49436 10.994633 12.57267 18.09175 19.74324 12.68352 13.62575 18.63249
#>    Marker_9 Marker_10 Marker_11 Marker_12 Marker_13 Marker_14 Marker_15
#>       <num>     <num>     <num>     <num>     <num>     <num>     <num>
#> 1: 17.64026  9.976136  14.32618  9.765389  13.29523 11.750356  9.927748
#> 2: 16.38115 10.339141  13.80488  9.286881  12.31639 10.623029 11.590447
#> 3: 18.02369  9.338943  14.48049  9.202693  13.70635 10.523010 10.811858
#> 4: 16.60652 10.606999  14.78955 10.594353  12.42579 12.184286 10.415856
#> 5: 17.79163 11.604362  13.41125  9.443441  14.04419 11.648467  9.879761
#> 6: 16.68128 10.546510  14.29600 11.218929  13.26562  8.945384  9.953343
#>      Sample Cell_Id
#>      <char>  <char>
#> 1: Sample_1  Cell_1
#> 2: Sample_1  Cell_2
#> 3: Sample_1  Cell_3
#> 4: Sample_1  Cell_4
#> 5: Sample_1  Cell_5
#> 6: Sample_1  Cell_6

For our toy dataset, we will transform our data using arcsinh transformation. We will use the base R asinh function to do this:

# Specify which columns are the markers to transform
marker_cols <- paste0("Marker_", seq_len(n_markers))
# The co-factor for arc-sinh
cofactor <- 5

# Do the transformation
dat_asinh <- asinh(dat[, marker_cols, with = FALSE] / cofactor)

# Rename the new columns
marker_cols_asinh <- paste0(marker_cols, "_asinh")
names(dat_asinh) <- marker_cols_asinh

# Add them our previously loaded data
dat <- cbind(dat, dat_asinh)

head(dat[, marker_cols_asinh, with = FALSE])
#>    Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh Marker_5_asinh
#>             <num>          <num>          <num>          <num>          <num>
#> 1:       1.760707       1.407148       1.598662       1.992992       2.038698
#> 2:       1.986084       1.480454       1.663585       1.965206       2.053063
#> 3:       1.909866       1.372046       1.526479       2.029192       1.951904
#> 4:       1.908459       1.500792       1.548709       2.030158       2.109980
#> 5:       1.923712       1.346291       1.716609       2.032820       2.046829
#> 6:       1.849257       1.529216       1.652615       1.997735       2.082182
#>    Marker_6_asinh Marker_7_asinh Marker_8_asinh Marker_9_asinh Marker_10_asinh
#>             <num>          <num>          <num>          <num>           <num>
#> 1:       1.614794       1.707248       2.061551       1.973399        1.441499
#> 2:       1.646312       1.592978       2.062875       1.902358        1.473564
#> 3:       1.802789       1.730368       2.129148       1.994103        1.382896
#> 4:       1.716142       1.590512       2.104842       1.915434        1.496645
#> 5:       1.734078       1.713767       1.986347       1.981622        1.578569
#> 6:       1.660777       1.727751       2.026151       1.919736        1.491475
#>    Marker_11_asinh Marker_12_asinh Marker_13_asinh Marker_14_asinh
#>              <num>           <num>           <num>           <num>
#> 1:        1.774946        1.422452        1.704732        1.590062
#> 2:        1.740022        1.377970        1.633506        1.498011
#> 3:        1.785067        1.369960        1.733290        1.489460
#> 4:        1.805052        1.495566        1.641705        1.623522
#> 5:        1.712869        1.392718        1.756197        1.582054
#> 6:        1.772956        1.547631        1.702646        1.345124
#>    Marker_15_asinh
#>              <num>
#> 1:        1.437154
#> 2:        1.577468
#> 3:        1.513978
#> 4:        1.480224
#> 5:        1.432829
#> 6:        1.439455

We will also create a column Cell_id_dummy which uniquely identify each cell. It will have values such as Cell_1, Cell_2, all the way until Cell_x where x is the number of cells in the dataset.

dat$Cell_id_dummy <- paste0("Cell_", seq_len(nrow(dat)))
head(dat$Cell_id_dummy, n = 10)
#>  [1] "Cell_1"  "Cell_2"  "Cell_3"  "Cell_4"  "Cell_5"  "Cell_6"  "Cell_7" 
#>  [8] "Cell_8"  "Cell_9"  "Cell_10"

By default, the simCytoData function will generate cells for multiple samples, and that the resulting data.table object will already have a column called Sample that denotes the sample the cells come from.

unique(dat$Sample)
#> [1] "Sample_1" "Sample_2" "Sample_3"

Let’s take note of the sample and cell id column for later.

sample_col <- "Sample"
cell_id_col <- "Cell_id_dummy"

0.4 Creating supercells

Now that we have our data, let’s create some supercells. To do this, we will use runSuperCellCyto function and pass the markers, sample and cell ID columns as parameters.

The reason why we need to specify the markers is because the function will create supercells based on only the expression of those markers. We highly recommend creating supercells using all markers in your data, let that be cell type or cell state markers. However, if for any reason you only want to only use a subset of the markers in your data, then make sure you specify them in a vector that you later pass to runSuperCellCyto function.

For this tutorial, we will use all the arcsinh transformed markers in the toy data.

supercells <- runSuperCellCyto(
    dt = dat,
    markers = marker_cols_asinh,
    sample_colname = sample_col,
    cell_id_colname = cell_id_col
)

Let’s dig deeper into the object it created:

class(supercells)
#> [1] "list"

It is a list containing 3 elements:

names(supercells)
#> [1] "supercell_expression_matrix" "supercell_cell_map"         
#> [3] "supercell_object"

0.4.1 Supercell object

The supercell_object contains the metadata used to create the supercells. It is a list, and each element contains the metadata used to create the supercells for a sample. This will come in handy if we need to either regenerate the supercells using different gamma values (so we get more or less supercells) or do some debugging later down the line. More on regenerating supercells on Controlling supercells granularity section below.

0.4.2 Supercell expression matrix

The supercell_expression_matrix contains the marker expression of each supercell. These are calculated by taking the average of the marker expression of all the cells contained within a supercell.

head(supercells$supercell_expression_matrix)
#>    Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh Marker_5_asinh
#>             <num>          <num>          <num>          <num>          <num>
#> 1:       1.918714       1.363350       1.634905       1.974282       2.076365
#> 2:       1.898759       1.391264       1.647058       1.978942       2.073312
#> 3:       1.902137       1.355906       1.583481       2.012976       2.048937
#> 4:       1.911340       1.443371       1.583428       1.999973       2.074876
#> 5:       1.890141       1.426217       1.565545       2.009033       2.056306
#> 6:       1.871039       1.375714       1.596982       1.963112       2.054899
#>    Marker_6_asinh Marker_7_asinh Marker_8_asinh Marker_9_asinh Marker_10_asinh
#>             <num>          <num>          <num>          <num>           <num>
#> 1:       1.712418       1.746161       2.047954       1.899827        1.513845
#> 2:       1.696940       1.644858       2.029150       1.907937        1.574294
#> 3:       1.575609       1.642327       2.030645       1.913557        1.574551
#> 4:       1.733172       1.659265       2.046946       1.921459        1.508844
#> 5:       1.573096       1.614578       2.037746       1.931451        1.507199
#> 6:       1.754880       1.569758       2.030399       1.898520        1.480575
#>    Marker_11_asinh Marker_12_asinh Marker_13_asinh Marker_14_asinh
#>              <num>           <num>           <num>           <num>
#> 1:        1.754844        1.604455        1.781588        1.538798
#> 2:        1.777910        1.461293        1.788917        1.363175
#> 3:        1.785114        1.519700        1.695269        1.586754
#> 4:        1.826998        1.308689        1.658415        1.398335
#> 5:        1.768328        1.601401        1.797185        1.509926
#> 6:        1.777213        1.490051        1.778557        1.489969
#>    Marker_15_asinh   Sample                 SuperCellId
#>              <num>   <char>                      <char>
#> 1:        1.364427 Sample_1 SuperCell_1_Sample_Sample_1
#> 2:        1.388583 Sample_1 SuperCell_2_Sample_Sample_1
#> 3:        1.537212 Sample_1 SuperCell_3_Sample_Sample_1
#> 4:        1.390881 Sample_1 SuperCell_4_Sample_Sample_1
#> 5:        1.478060 Sample_1 SuperCell_5_Sample_Sample_1
#> 6:        1.470914 Sample_1 SuperCell_6_Sample_Sample_1

Therein, we will have the following columns:

All the markers we previously specified in the markers_col variable. In this example, they are the arcsinh transformed markers in our toy data.
A column (Sample in this case) denoting which sample a supercell belongs to, (note the column name is the same as what is stored in sample_col variable).
The SuperCellId column denoting the unique ID of the supercell.

0.4.2.1 SuperCellId

Let’s have a look at SuperCellId:

head(unique(supercells$supercell_expression_matrix$SuperCellId))
#> [1] "SuperCell_1_Sample_Sample_1" "SuperCell_2_Sample_Sample_1"
#> [3] "SuperCell_3_Sample_Sample_1" "SuperCell_4_Sample_Sample_1"
#> [5] "SuperCell_5_Sample_Sample_1" "SuperCell_6_Sample_Sample_1"

Let’s break down one of them, SuperCell_1_Sample_Sample_1. SuperCell_1 is a numbering (1 to however many supercells there are in a sample) used to uniquely identify each supercell in a sample. Notably, you may encounter this (SuperCell_1, SuperCell_2) being repeated across different samples, e.g.,

supercell_ids <- unique(supercells$supercell_expression_matrix$SuperCellId)
supercell_ids[grep("SuperCell_1_", supercell_ids)]
#> [1] "SuperCell_1_Sample_Sample_1" "SuperCell_1_Sample_Sample_2"
#> [3] "SuperCell_1_Sample_Sample_3"

While these 3 supercells’ id are pre-fixed with SuperCell_1, it does not make them equal to one another! SuperCell_1_Sample_Sample_1 will only contain cells from Sample_1 while SuperCell_1_Sample_Sample_2 will only contain cells from Sample_2.

By now, you may have noticed that we appended the sample name into each supercell id. This aids in differentiating the supercells in different samples.

0.4.3 Supercell cell map

supercell_cell_map maps each cell in our dataset to the supercell it belongs to.

head(supercells$supercell_cell_map)
#>                      SuperCellID CellId   Sample
#>                           <char> <char>   <char>
#> 1: SuperCell_352_Sample_Sample_1 Cell_1 Sample_1
#> 2: SuperCell_201_Sample_Sample_1 Cell_2 Sample_1
#> 3: SuperCell_190_Sample_Sample_1 Cell_3 Sample_1
#> 4:  SuperCell_61_Sample_Sample_1 Cell_4 Sample_1
#> 5:  SuperCell_45_Sample_Sample_1 Cell_5 Sample_1
#> 6:  SuperCell_69_Sample_Sample_1 Cell_6 Sample_1

This map is very useful if we later need to expand the supercells out. Additionally, this is also the reason why we need to have a column in the dataset which uniquely identify each cell.

0.5 Running `runSuperCellCyto` in parallel

By default, runSuperCellCyto will process each sample one after the other. As each sample is processed independent of one another, strictly speaking, we can process all of them in parallel.

To do this, we need to:

Create a BiocParallelParam object from the BiocParallel package. This object can either be of type MulticoreParamor SnowParam. We highly recommend consulting their vignette for more information.
Set the number of tasks for the BiocParallelParam object to the number of samples we have in the dataset.
Set the load_balancing parameter for runSuperCellCyto function to TRUE. This is to ensure even distribution of the supercell creation jobs. As each sample will be processed by a parallel job, we don’t want a job that processs large sample to also be assigned other smaller samples if possible. If you want to know more how this feature works, please refer to our manuscript.

supercell_par <- runSuperCellCyto(
    dt = dat,
    markers = marker_cols_asinh,
    sample_colname = sample_col,
    cell_id_colname = cell_id_col,
    BPPARAM = MulticoreParam(tasks = n_samples),
    load_balancing = TRUE
)

0.6 Controlling supercells granularity

This is described in the runSuperCellCyto function’s documentation, but let’s briefly go through it here.

The runSuperCellCyto function is equipped with various parameters which can be customised to alter the composition of the supercells. The one that is very likely to be used the most is the gamma parameter, denoted as gam in the function. By default, the value for gam is set to 20, which we found work well for most cases.

The gamma parameter controls how many supercells to generate, and indirectly, how many cells are captured within each supercell. This parameter is resolved into the following formula gamma=n_cells/n_supercells where n_cell denotes the number of cells and n_supercells denotes the number of supercells.

In general, the larger gamma parameter is set to, the less supercells we will get. Say for instance we have 10,000 cells. If gamma is set to 10, we will end up with about 1,000 supercells, whereas if gamma is set to 50, we will end up with about 200 supercells.

You may have noticed, after reading the sections above, runSuperCellCyto is ran on each sample independent of each other, and that we can only set 1 value as the gamma parameter. Indeed, for now, the same gamma value will be used across all samples, and that depending on how many cells we have in each sample, we will end up with different number of supercells for each sample. For instance, say we have 10,000 cells for sample 1, and 100,000 cells for sample 2. If gamma is set to 10, for sample 1, we will get 1,000 supercells (10,000/10) while for sample 2, we will get 10,000 supercells (100,000/10).

Do note: whatever gamma value you chose, you should not expect each supercell to contain exactly the same number of cells. This behaviour is intentional to ensure rare cell types are not intermixed with non-rare cell types in a supercell.

0.6.1 Adjusting gamma value after one run of runSuperCellCyto

If you have run runSuperCellCyto once and have not discarded the SuperCell object it generated (no serious, please don’t!), you can use the object to quickly regenerate supercells using different gamma values.

As an example, using the SuperCell object we have generated for our toy dataset, we will regenerate the supercells using gamma of 10 and 50. The function to do this is recomputeSupercells. We will store the output in a list, one element per gamma value.

addt_gamma_vals <- c(10, 50)
supercells_addt_gamma <- lapply(addt_gamma_vals, function(gam) {
    recomputeSupercells(
        dt = dat,
        sc_objects = supercells$supercell_object,
        markers = marker_cols_asinh,
        sample_colname = sample_col,
        cell_id_colname = cell_id_col,
        gam = gam
    )
})

We should end up with a list containing 2 elements. The 1st element contains supercells generated using gamma = 10, and the 2nd contains supercells generated using gamma = 50.

supercells_addt_gamma[[1]]
#> $supercell_expression_matrix
#>       Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh
#>                <num>          <num>          <num>          <num>
#>    1:      1.9169187      1.1999514       1.588854       1.990529
#>    2:      1.9148354      1.4137154       1.675118       2.013423
#>    3:      1.8840093      1.3982606       1.668060       1.994482
#>    4:      1.9023423      1.5111980       1.562015       1.985963
#>    5:      1.9309447      1.4087217       1.564687       2.008816
#>   ---                                                            
#> 2996:      0.9107372      1.2128674       1.436424       2.023500
#> 2997:      0.8926815      0.8030285       1.352205       2.007139
#> 2998:      1.0940560      0.8054350       1.444942       2.054105
#> 2999:      1.1582536      0.7730241       1.345353       2.004461
#> 3000:      0.9291563      0.7303235       1.391177       2.049737
#>       Marker_5_asinh Marker_6_asinh Marker_7_asinh Marker_8_asinh
#>                <num>          <num>          <num>          <num>
#>    1:       2.077942       1.728616       1.662543       2.062330
#>    2:       2.070545       1.649655       1.668810       2.045155
#>    3:       2.077142       1.722355       1.726207       2.022998
#>    4:       2.044133       1.745337       1.656945       2.033343
#>    5:       2.074758       1.689522       1.780723       2.053251
#>   ---                                                            
#> 2996:       1.411086       1.982632       1.955734       1.428565
#> 2997:       1.507856       2.007144       2.001514       1.373042
#> 2998:       1.391570       2.003222       1.936712       1.268264
#> 2999:       1.582658       2.036092       2.026537       1.282472
#> 3000:       1.522914       2.014550       2.002665       1.483833
#>       Marker_9_asinh Marker_10_asinh Marker_11_asinh Marker_12_asinh
#>                <num>           <num>           <num>           <num>
#>    1:       1.936505        1.429029        1.824300        1.582314
#>    2:       1.944328        1.525346        1.834899        1.422539
#>    3:       1.910562        1.554527        1.725051        1.498212
#>    4:       1.918626        1.574290        1.740156        1.555368
#>    5:       1.947010        1.592065        1.779867        1.497639
#>   ---                                                               
#> 2996:       1.993855        1.806149        1.856148        1.942023
#> 2997:       2.052679        1.842637        1.822284        1.993148
#> 2998:       2.017017        1.783485        1.744512        1.908353
#> 2999:       2.046247        1.891602        1.874307        1.992694
#> 3000:       2.029480        1.816034        1.871715        1.998296
#>       Marker_13_asinh Marker_14_asinh Marker_15_asinh   Sample
#>                 <num>           <num>           <num>   <char>
#>    1:        1.680428        1.517534        1.510864 Sample_1
#>    2:        1.831872        1.560578        1.300743 Sample_1
#>    3:        1.739176        1.517764        1.364693 Sample_1
#>    4:        1.680179        1.408415        1.381029 Sample_1
#>    5:        1.677598        1.614226        1.417906 Sample_1
#>   ---                                                         
#> 2996:        1.751926        1.963990        1.869169 Sample_3
#> 2997:        1.625121        1.975017        1.821959 Sample_3
#> 2998:        1.856183        1.925581        1.905517 Sample_3
#> 2999:        1.840811        1.994031        1.855652 Sample_3
#> 3000:        1.838138        1.956188        1.932302 Sample_3
#>                          SuperCellId
#>                               <char>
#>    1:    SuperCell_1_Sample_Sample_1
#>    2:    SuperCell_2_Sample_Sample_1
#>    3:    SuperCell_3_Sample_Sample_1
#>    4:    SuperCell_4_Sample_Sample_1
#>    5:    SuperCell_5_Sample_Sample_1
#>   ---                               
#> 2996:  SuperCell_996_Sample_Sample_3
#> 2997:  SuperCell_997_Sample_Sample_3
#> 2998:  SuperCell_998_Sample_Sample_3
#> 2999:  SuperCell_999_Sample_Sample_3
#> 3000: SuperCell_1000_Sample_Sample_3
#> 
#> $supercell_cell_map
#>                          SuperCellID     CellId   Sample
#>                               <char>     <char>   <char>
#>     1: SuperCell_322_Sample_Sample_1     Cell_1 Sample_1
#>     2: SuperCell_477_Sample_Sample_1     Cell_2 Sample_1
#>     3: SuperCell_595_Sample_Sample_1     Cell_3 Sample_1
#>     4: SuperCell_256_Sample_Sample_1     Cell_4 Sample_1
#>     5: SuperCell_559_Sample_Sample_1     Cell_5 Sample_1
#>    ---                                                  
#> 29996: SuperCell_391_Sample_Sample_3 Cell_29996 Sample_3
#> 29997:   SuperCell_5_Sample_Sample_3 Cell_29997 Sample_3
#> 29998: SuperCell_508_Sample_Sample_3 Cell_29998 Sample_3
#> 29999: SuperCell_177_Sample_Sample_3 Cell_29999 Sample_3
#> 30000: SuperCell_579_Sample_Sample_3 Cell_30000 Sample_3

The output generated by recomputeSupercells is essentially a list:

supercell_expression_matrix: A data.table object that contains the marker expression for each supercell.
supercell_cell_map: A data.table that maps each cell to its corresponding supercell.

As mentioned before, gamma dictates the granularity of supercells. Compared to the previous run where gamma was set to 20, we should get more supercells for gamma = 10, and less for gamma = 50. Let’s see if that’s the case.

n_supercells_gamma20 <- nrow(supercells$supercell_expression_matrix)
n_supercells_gamma10 <- nrow(
    supercells_addt_gamma[[1]]$supercell_expression_matrix
)
n_supercells_gamma50 <- nrow(
    supercells_addt_gamma[[2]]$supercell_expression_matrix
)

n_supercells_gamma10 > n_supercells_gamma20
#> [1] TRUE

n_supercells_gamma50 < n_supercells_gamma20
#> [1] TRUE

0.6.2 Specifying different gamma value for different samples

In the future, we may add the ability to specify different gam value for different samples. For now, if we want to do this, we will need to break down our data into multiple data.table objects, each containing data from 1 sample, and run runSuperCellCyto function on each of them with different gam parameter value. Something like the following:

n_markers <- 10
dat <- simCytoData(nmarkers = n_markers)
markers_col <- paste0("Marker_", seq_len(n_markers))
sample_col <- "Sample"
cell_id_col <- "Cell_Id"

samples <- unique(dat[[sample_col]])
gam_values <- c(10, 20, 10)

supercells_diff_gam <- lapply(seq_len(length(samples)), function(i) {
    sample <- samples[i]
    gam <- gam_values[i]
    dat_samp <- dat[dat$Sample == sample, ]
    supercell_samp <- runSuperCellCyto(
        dt = dat_samp,
        markers = markers_col,
        sample_colname = sample_col,
        cell_id_colname = cell_id_col,
        gam = gam
    )
    return(supercell_samp)
})

Subsequently, to extract and combine the supercell_expression_matrix and supercell_cell_map, we will need to use rbind:

supercell_expression_matrix <- do.call(
    "rbind", lapply(
        supercells_diff_gam, function(x) x[["supercell_expression_matrix"]]
    )
)

supercell_cell_map <- do.call(
    "rbind", lapply(
        supercells_diff_gam, function(x) x[["supercell_cell_map"]]
    )
)

rbind(
    head(supercell_expression_matrix, n = 3),
    tail(supercell_expression_matrix, n = 3)
)
#>    Marker_1 Marker_2 Marker_3 Marker_4  Marker_5  Marker_6  Marker_7  Marker_8
#>       <num>    <num>    <num>    <num>     <num>     <num>     <num>     <num>
#> 1: 8.080554 11.66585 14.16427 18.16930 10.773532  8.160025  7.721911 10.249991
#> 2: 7.454159 11.70921 14.33130 18.17277  9.085562  6.573177  9.079602 10.790958
#> 3: 6.970104 12.64828 13.76328 16.98876  9.508398  6.549137 10.262713  9.304498
#> 4: 9.719064 14.79318 11.23069 18.14624 14.942458  9.481382  9.388381  6.187012
#> 5: 9.312357 15.36611 12.70266 18.58997 17.255206 11.842509 10.301423  7.461776
#> 6: 8.311587 14.89450 13.70381 18.28349 15.754223 10.500806  8.461305  7.136647
#>    Marker_9 Marker_10   Sample                   SuperCellId
#>       <num>     <num>   <char>                        <char>
#> 1: 16.87994  15.44629 Sample_1   SuperCell_1_Sample_Sample_1
#> 2: 17.36751  15.80407 Sample_1   SuperCell_2_Sample_Sample_1
#> 3: 16.64069  16.29036 Sample_1   SuperCell_3_Sample_Sample_1
#> 4: 14.43216  13.38659 Sample_2 SuperCell_498_Sample_Sample_2
#> 5: 16.88279  12.49497 Sample_2 SuperCell_499_Sample_Sample_2
#> 6: 14.05176  12.70478 Sample_2 SuperCell_500_Sample_Sample_2

rbind(head(supercell_cell_map, n = 3), tail(supercell_cell_map, n = 3))
#>                      SuperCellID     CellId   Sample
#>                           <char>     <char>   <char>
#> 1:  SuperCell_51_Sample_Sample_1     Cell_1 Sample_1
#> 2: SuperCell_280_Sample_Sample_1     Cell_2 Sample_1
#> 3: SuperCell_846_Sample_Sample_1     Cell_3 Sample_1
#> 4:  SuperCell_82_Sample_Sample_2 Cell_19998 Sample_2
#> 5: SuperCell_237_Sample_Sample_2 Cell_19999 Sample_2
#> 6: SuperCell_215_Sample_Sample_2 Cell_20000 Sample_2

0.7 Mixing cells from different samples in a supercell

If for whatever reason you don’t mind (or perhaps more to the point want) each supercell to contain cells from different biological samples, you still need to have the sample column in your data.table. However, what you need to do is essentially set the value in the column to exactly one unique value. That way, SuperCellCyto will treat all cells as coming from one sample.

Just note, the parallel processing feature in SuperCellCyto won’t work for this as you will essentially only have 1 sample and nothing for SuperCellCyto to parallelise.

0.8 I have more cells than RAM in my computer

Is your dataset so huge that you are constantly running out of RAM when generating supercells? This thing happens and we have a solution for it.

Since supercells are generated for each sample independent of others you can easily break up the process. For example:

Load up a subset of the samples (say 1-10).
Generate supercells for those samples.
Save the output using the qs package.
Extract the supercell_expression_matrix and supercell_cell_map, and export them out as a csv file using data.table’s fwrite function.
Load another sets of samples (say 11-20), rinse and repeat step 2-4.

Once you have processed all the samples, you can then load all supercell_expression_matrix and supercell_cell_map csv files and analyse them.

If you want to regenerate the supercells using different gamma values, load the relevant output saved using the qs package and the relevant data (remember to note which output belongs to which sets of samples!), and run recomputeSupercells function.

0.9 Session information

sessionInfo()
#> R Under development (unstable) (2025-10-20 r88955)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] BiocParallel_1.45.0 SuperCellCyto_1.1.0 BiocStyle_2.39.0   
#> 
#> loaded via a namespace (and not attached):
#>  [1] cli_3.6.5           knitr_1.50          rlang_1.1.6        
#>  [4] xfun_0.54           jsonlite_2.0.0      data.table_1.17.8  
#>  [7] plyr_1.8.9          htmltools_0.5.8.1   sass_0.4.10        
#> [10] rmarkdown_2.30      grid_4.6.0          evaluate_1.0.5     
#> [13] jquerylib_0.1.4     fastmap_1.2.0       yaml_2.3.10        
#> [16] lifecycle_1.0.4     bookdown_0.45       BiocManager_1.30.26
#> [19] compiler_4.6.0      igraph_2.2.1        codetools_0.2-20   
#> [22] Rcpp_1.1.0          pkgconfig_2.0.3     lattice_0.22-7     
#> [25] digest_0.6.37       SuperCell_1.0.1     R6_2.6.1           
#> [28] RANN_2.6.2          magrittr_2.0.4      bslib_0.9.0        
#> [31] Matrix_1.7-4        tools_4.6.0         cachem_1.1.0

How to create supercells

30 October 2025

Package

Contents

0.1 Introduction

0.2 Installation

0.3 Preparing your dataset

0.4 Creating supercells

0.4.1 Supercell object

0.4.2 Supercell expression matrix

0.4.2.1 SuperCellId

0.4.3 Supercell cell map

0.5 Running `runSuperCellCyto` in parallel

0.6 Controlling supercells granularity

0.6.1 Adjusting gamma value after one run of runSuperCellCyto

0.6.2 Specifying different gamma value for different samples

0.7 Mixing cells from different samples in a supercell

0.8 I have more cells than RAM in my computer

0.9 Session information

How to create supercells

30 October 2025

Package

Contents

0.1 Introduction

0.2 Installation

0.3 Preparing your dataset

0.4 Creating supercells

0.4.1 Supercell object

0.4.2 Supercell expression matrix

0.4.2.1 SuperCellId

0.4.3 Supercell cell map

0.5 Running runSuperCellCyto in parallel

0.6 Controlling supercells granularity

0.6.1 Adjusting gamma value after one run of runSuperCellCyto

0.6.2 Specifying different gamma value for different samples

0.7 Mixing cells from different samples in a supercell

0.8 I have more cells than RAM in my computer

0.9 Session information

0.5 Running `runSuperCellCyto` in parallel