Using fgsea package

fgsea is an R-package for fast preranked gene set enrichment analysis (GSEA). This package allows to quickly and accurately calculate arbitrarily low GSEA P-values for a collection of gene sets. P-value estimation is based on an adaptive multi-level split Monte-Carlo scheme. See the preprint for algorithmic details.

Loading necessary libraries

library(fgsea)
library(data.table)
library(ggplot2)

Quick run

Loading example pathways and gene-level statistics and setting random seed:

data(examplePathways)
data(exampleRanks)
set.seed(42)

Running fgsea:

fgseaRes <- fgsea(pathways = examplePathways, 
                  stats    = exampleRanks,
                  minSize  = 15,
                  maxSize  = 500)
## Warning in fgseaMultilevel(...): For some pathways, in reality P-values are less
## than 1e-10. You can set the `eps` argument to zero for better estimation.

The resulting table contains enrichment scores and p-values:

head(fgseaRes[order(pval), ])
##                        pathway  pval         padj log2err        ES      NES
## 1: 5990979_Cell_Cycle,_Mitotic 1e-10 3.906667e-09      NA 0.5594755 2.769070
## 2:          5990980_Cell_Cycle 1e-10 3.906667e-09      NA 0.5388497 2.705894
## 3:     5990981_DNA_Replication 1e-10 3.906667e-09      NA 0.6440006 2.639381
## 4:    5990987_Synthesis_of_DNA 1e-10 3.906667e-09      NA 0.6478555 2.642865
## 5:             5990988_S_Phase 1e-10 3.906667e-09      NA 0.6013092 2.529646
## 6:     5990990_G1_S_Transition 1e-10 3.906667e-09      NA 0.6232905 2.573846
##    size                              leadingEdge
## 1:  317 66336,66977,12442,107995,66442,12571,...
## 2:  369 66336,66977,12442,107995,66442,19361,...
## 3:   82  57441,17219,69270,12575,69263,17215,...
## 4:   78  17219,69270,12575,69263,17215,68240,...
## 5:   98  67849,17219,69270,12575,69263,71988,...
## 6:   84  20135,13555,17219,12575,12448,17215,...

As you can see from the warning, fgsea has a default lower bound eps=1e-10 for estimating P-values. If you need to estimate P-value more accurately, you can set the eps argument to zero in the fgsea function.

fgseaRes <- fgsea(pathways = examplePathways, 
                  stats    = exampleRanks,
                  eps      = 0.0,
                  minSize  = 15,
                  maxSize  = 500)

head(fgseaRes[order(pval), ])
##                                            pathway         pval         padj
## 1:                              5990980_Cell_Cycle 2.535645e-26 1.485888e-23
## 2:                     5990979_Cell_Cycle,_Mitotic 9.351994e-26 2.740134e-23
## 3:                    5991851_Mitotic_Prometaphase 3.633805e-19 7.098033e-17
## 4: 5992217_Resolution_of_Sister_Chromatid_Cohesion 2.077985e-17 3.044248e-15
## 5:                                 5991454_M_Phase 2.251818e-14 2.639131e-12
## 6:          5991502_Mitotic_Metaphase_and_Anaphase 3.196758e-14 3.122167e-12
##      log2err        ES      NES size                              leadingEdge
## 1: 1.3344975 0.5388497 2.664606  369 66336,66977,12442,107995,66442,19361,...
## 2: 1.3188888 0.5594755 2.740246  317 66336,66977,12442,107995,66442,12571,...
## 3: 1.1330899 0.7253270 2.926512   82 66336,66977,12442,107995,66442,52276,...
## 4: 1.0768682 0.7347987 2.920436   74 66336,66977,12442,107995,66442,52276,...
## 5: 0.9759947 0.5576247 2.547515  173 66336,66977,12442,107995,66442,52276,...
## 6: 0.9653278 0.6052907 2.639370  123 66336,66977,107995,66442,52276,67629,...

One can make an enrichment plot for a pathway:

plotEnrichment(examplePathways[["5991130_Programmed_Cell_Death"]],
               exampleRanks) + labs(title="Programmed Cell Death")

Or make a table plot for a bunch of selected pathways:

topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(examplePathways[topPathways], exampleRanks, fgseaRes, 
              gseaParam=0.5)

From the plot above one can see that there are very similar pathways in the table (for example 5991502_Mitotic_Metaphase_and_Anaphase and 5991600_Mitotic_Anaphase). To select only independent pathways one can use collapsePathways function:

collapsedPathways <- collapsePathways(fgseaRes[order(pval)][padj < 0.01], 
                                      examplePathways, exampleRanks)
mainPathways <- fgseaRes[pathway %in% collapsedPathways$mainPathways][
                         order(-NES), pathway]
plotGseaTable(examplePathways[mainPathways], exampleRanks, fgseaRes, 
              gseaParam = 0.5)

To save the results in a text format data:table::fwrite function can be used:

fwrite(fgseaRes, file="fgseaRes.txt", sep="\t", sep2=c("", " ", ""))

To make leading edge more human-readable it can be converted using mapIdsList (similar to AnnotationDbi::mapIds) function and a corresponding database (here org.Mm.eg.db for mouse):

library(org.Mm.eg.db)
fgseaResMain <- fgseaRes[match(mainPathways, pathway)]
fgseaResMain[, leadingEdge := mapIdsList(
                                     x=org.Mm.eg.db, 
                                     keys=leadingEdge,
                                     keytype="ENTREZID", 
                                     column="SYMBOL")]
fwrite(fgseaResMain, file="fgseaResMain.txt", sep="\t", sep2=c("", " ", ""))

Performance considerations

Also, fgsea is parallelized using BiocParallel package. By default the first registered backend returned by bpparam() is used. To tweak the parallelization one can either specify BPPARAM parameter used for bplapply of set nproc parameter, which is a shorthand for setting BPPARAM=MulticoreParam(workers = nproc).

Using Reactome pathways

For convenience there is reactomePathways function that obtains pathways from Reactome for given set of genes. Package reactome.db is required to be installed.

pathways <- reactomePathways(names(exampleRanks))
fgseaRes <- fgsea(pathways, exampleRanks, maxSize=500)
head(fgseaRes)
##                                                            pathway        pval
## 1:                      5-Phosphoribose 1-diphosphate biosynthesis 0.845857418
## 2: A tetrasaccharide linker sequence is required for GAG synthesis 0.662139219
## 3:                           ABC transporters in lipid homeostasis 0.175355450
## 4:                          ABC-family proteins mediated transport 0.402821317
## 5:                                    ABO blood group biosynthesis 0.973630832
## 6:                       ADP signalling through P2Y purinoceptor 1 0.009449158
##          padj    log2err         ES        NES size
## 1: 0.93691770 0.05163560  0.4267378  0.6868624    2
## 2: 0.85660244 0.05712585  0.3218180  0.8365521   11
## 3: 0.50980612 0.16197895 -0.4385385 -1.2903382   12
## 4: 0.71409233 0.07767986  0.2614189  1.0381043   66
## 5: 0.98680608 0.04754342  0.5120427  0.6839611    1
## 6: 0.08440896 0.38073040  0.6228710  1.7635617   16
##                                 leadingEdge
## 1:                             328099,19139
## 2: 14733,20971,20970,12032,29873,218271,...
## 3: 19299,27403,11307,11806,217265,27409,...
## 4: 17463,26440,26444,19179,228769,56325,...
## 5:                                    14344
## 6:  14696,14702,14700,14682,14676,66066,...

Starting from files

One can also start from .rnk and .gmt files as in original GSEA:

rnk.file <- system.file("extdata", "naive.vs.th1.rnk", package="fgsea")
gmt.file <- system.file("extdata", "mouse.reactome.gmt", package="fgsea")

Loading ranks:

ranks <- read.table(rnk.file,
                    header=TRUE, colClasses = c("character", "numeric"))
ranks <- setNames(ranks$t, ranks$ID)
str(ranks)
##  Named num [1:12000] -63.3 -49.7 -43.6 -41.5 -33.3 ...
##  - attr(*, "names")= chr [1:12000] "170942" "109711" "18124" "12775" ...

Loading pathways:

pathways <- gmtPathways(gmt.file)
str(head(pathways))
## List of 6
##  $ 1221633_Meiotic_Synapsis                                                : chr [1:64] "12189" "13006" "15077" "15078" ...
##  $ 1368092_Rora_activates_gene_expression                                  : chr [1:9] "11865" "12753" "12894" "18143" ...
##  $ 1368110_Bmal1:Clock,Npas2_activates_circadian_gene_expression           : chr [1:16] "11865" "11998" "12753" "12952" ...
##  $ 1445146_Translocation_of_Glut4_to_the_Plasma_Membrane                   : chr [1:55] "11461" "11465" "11651" "11652" ...
##  $ 186574_Endocrine-committed_Ngn3+_progenitor_cells                       : chr [1:4] "18012" "18088" "18506" "53626"
##  $ 186589_Late_stage_branching_morphogenesis_pancreatic_bud_precursor_cells: chr [1:4] "11925" "15205" "21410" "246086"

And runnig fgsea:

fgseaRes <- fgsea(pathways, ranks, minSize=15, maxSize=500)
head(fgseaRes)
##                                                                                    pathway
## 1:                                                                1221633_Meiotic_Synapsis
## 2:                                   1445146_Translocation_of_Glut4_to_the_Plasma_Membrane
## 3: 442533_Transcriptional_Regulation_of_Adipocyte_Differentiation_in_3T3-L1_Pre-adipocytes
## 4:                                                                  508751_Circadian_Clock
## 5:                                               5334727_Mus_musculus_biological_processes
## 6:                                        573389_NoRC_negatively_regulates_rRNA_expression
##         pval      padj    log2err         ES        NES size
## 1: 0.5152057 0.6830556 0.07182763  0.2885754  0.9561315   27
## 2: 0.7045455 0.8257273 0.05559471  0.2387284  0.8481883   39
## 3: 0.1075515 0.2626049 0.20658792 -0.3640706 -1.3578111   31
## 4: 0.7982301 0.8773456 0.05039643  0.2516324  0.7454045   17
## 5: 0.3582317 0.5489677 0.08243441  0.2469065  1.0599563  106
## 6: 0.3628319 0.5536965 0.08998608  0.3607407  1.0686133   17
##                                 leadingEdge
## 1:                  15270,12189,71846,19357
## 2:  17918,19341,20336,22628,22627,20619,...
## 3: 76199,19014,26896,229003,17977,17978,...
## 4:                        20893,59027,19883
## 5:  60406,19361,15270,20893,12189,68240,...
## 6:                 60406,20018,245688,20017