clusterProfiler supports enrichment analysis of Gene Ontology (GO) and Kyoto Encyclopedia of genes and Genomes (KEGG) with either hypergeometric test or Gene Set Enrichment Analysis (GSEA). clusterProfiler adjust the estimated significance level to account for multiple hypothesis testing and also q-values were calculated for FDR control. It supports several visualization methods, including barplot
, cnetplot
, enrichMap
and gseaplot
. clusterProfiler also supports comparing functional profiles among gene clusters. It supports comparing biological themes of GO, KEGG, Disease Ontology (via DOSE) and Reactome pathways (via ReactomePA).
If you use clusterProfiler in published research, please cite G. Yu(2012). In addition, please cite G. Yu (2010) when using GOSemSim for GO semantic similarity analysis, G. Yu (2015) when using DOSE for Disease Ontology analysis, G. Yu (2016) when using ReactomePA for Reactome pathway analysis and G. Yu (2015) when applying enrichment analysis to NGS data by using ChIPseeker.
G Yu, LG Wang, Y Han, QY He.
clusterProfiler: an R package for comparing biological themes among gene clusters.
OMICS: A Journal of Integrative Biology 2012, 16(5):284-287.
URL: http://dx.doi.org/10.1089/omi.2011.0118
G Yu, F Li, Y Qin, X Bo, Y Wu, S Wang.
GOSemSim: an R package for measuring semantic similarity among GO terms and gene products.
Bioinformatics 2010, 26(7):976-978.
URL: http://dx.doi.org/10.1093/bioinformatics/btq064
G Yu, LG Wang, GR Yan, QY He.
DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis.
Bioinformatics 2015, 31(4):608-609.
URL: http://dx.doi.org/10.1093/bioinformatics/btu684
G Yu, QY He,
ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization.
Molecular BioSystems 2016, 12(2):477-479.
URL: http://dx.doi.org/10.1039/C5MB00663E
G Yu, LG Wang, QY He.
ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization.
Bioinformatics 2015, 31(14):2382-2383.
In recently years, high-throughput experimental techniques such as microarray, RNA-Seq and mass spectrometry can detect cellular molecules at systems-level. These kinds of analyses generate huge quantitaties of data, which need to be given a biological interpretation. A commonly used approach is via clustering in the gene dimension for grouping different genes based on their similarities1.
To search for shared functions among genes, a common way is to incorporate the biological knowledge, such as Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG), for identifying predominant biological themes of a collection of genes.
After clustering analysis, researchers not only want to determine whether there is a common theme of a particular gene cluster, but also to compare the biological themes among gene clusters. The manual step to choose interesting clusters followed by enrichment analysis on each selected cluster is slow and tedious. To bridge this gap, we designed clusterProfiler2, for comparing and visualizing functional profiles among gene clusters.
Many new R user may find traslating ID is a tedious task and I have received many feedbacks from clusterProfiler users that they don’t know how to convert gene symbol, uniprot ID or other ID types to Entrez gene ID that used in clusterProfiler for most of the species.
To remove this obstacle, We provide bitr
function for translating among different gene ID types.
x <- c("GPX3", "GLRX", "LBP", "CRYAB", "DEFB1", "HCLS1", "SOD2", "HSPA2",
"ORM1", "IGFBP1", "PTHLH", "GPC3", "IGFBP3","TOB1", "MITF", "NDRG1",
"NR1H4", "FGFR3", "PVR", "IL6", "PTPRM", "ERBB2", "NID2", "LAMB1",
"COMP", "PLS3", "MCAM", "SPP1", "LAMC1", "COL4A2", "COL4A1", "MYOC",
"ANXA4", "TFPI2", "CST6", "SLPI", "TIMP2", "CPM", "GGT1", "NNMT",
"MAL", "EEF1A2", "HGD", "TCN2", "CDA", "PCCA", "CRYM", "PDXK",
"STC1", "WARS", "HMOX1", "FXYD2", "RBP4", "SLC6A12", "KDELR3", "ITM2B")
eg = bitr(x, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
head(eg)
## SYMBOL ENTREZID
## 1 GPX3 2878
## 2 GLRX 2745
## 3 LBP 3929
## 4 CRYAB 1410
## 5 DEFB1 1672
## 6 HCLS1 3059
User should provides an annotation package, both fromType and toType can accept any types that supported.
User can use keytypes to list all supporting types.
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
## [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
## [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
## [13] "IPI" "MAP" "OMIM" "ONTOLOGY"
## [17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [25] "UNIGENE" "UNIPROT"
We can translate from one type to other types.
ids <- bitr(x, fromType="SYMBOL", toType=c("UNIPROT", "ENSEMBL"), OrgDb="org.Hs.eg.db")
head(ids)
## SYMBOL UNIPROT ENSEMBL
## 1 GPX3 P22352 ENSG00000211445
## 2 GLRX A0A024RAM2 ENSG00000173221
## 3 GLRX P35754 ENSG00000173221
## 4 LBP P18428 ENSG00000129988
## 5 LBP Q8TCF0 ENSG00000129988
## 6 CRYAB P02511 ENSG00000109846
GO analyses (groupGO()
, enrichGO()
and gseGO()
) support organisms that have an OrgDb
object available.
Bioconductor have already provide OrgDb
for about 20 species, see http://bioconductor.org/packages/release/BiocViews.html#___OrgDb.
We can build our own OrgDb
via AnnotationHub
.
For example:
library(AnnotationHub)
hub <- AnnotationHub()
query(hub, "Cricetulus")
## AnnotationHub with 5 records
## # snapshotDate(): 2016-08-15
## # $dataprovider: UCSC, Inparanoid8, NCBI, ftp://ftp.ncbi.nlm.nih.gov/ge...
## # $species: Cricetulus griseus
## # $rdataclass: OrgDb, ChainFile, Inparanoid8Db, TwoBitFile
## # additional mcols(): taxonomyid, genome, description, tags,
## # sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH10393"]]'
##
## title
## AH10393 | hom.Cricetulus_griseus.inp8.sqlite
## AH12820 | org.Cricetulus_griseus.eg.sqlite
## AH13980 | criGri1.2bit
## AH14346 | criGri1ToHg19.over.chain.gz
## AH48061 | org.Cricetulus_griseus.eg.sqlite
Cgriseus <- hub[["AH48061"]]
Cgriseus
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Cricetulus griseus
## | SPECIES: Cricetulus griseus
## | CENTRALID: GID
## | Taxonomy ID: 10029
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
If organism is not supported by AnnotationHub
, user can use AnnotationForge to build OrgDb
.
If user have GO annotation data (in data.frame format with first column of gene ID and second column of GO ID), they can use enricher()
and gseGO()
functions to perform over-representation test and gene set enrichment analysis.
If genes are annotated by direction annotation, it should also annotated by its ancestor GO nodes (indirect annation). If user only has direct annotation, they can pass their annotation to buildGOmap
function, which will infer indirection annotation and generate a data.frame
that suitable for both enricher()
and gseGO()
.
In clusterProfiler, groupGO
is designed for gene classification based on GO distribution at a specific level. Here we use dataset geneList provided by DOSE. Please refer to vignette of DOSE for more details.
data(geneList)
gene <- names(geneList)[abs(geneList) > 2]
gene.df <- bitr(gene, fromType = "ENTREZID",
toType = c("ENSEMBL", "SYMBOL"),
OrgDb = org.Hs.eg.db)
head(gene.df)
## ENTREZID ENSEMBL SYMBOL
## 1 4312 ENSG00000196611 MMP1
## 2 8318 ENSG00000093009 CDC45
## 3 10874 ENSG00000109255 NMU
## 4 55143 ENSG00000134690 CDCA8
## 5 55388 ENSG00000065328 MCM10
## 6 991 ENSG00000117399 CDC20
ggo <- groupGO(gene = gene,
OrgDb = org.Hs.eg.db,
ont = "BP",
level = 3,
readable = TRUE)
head(summary(ggo))
## ID Description Count
## GO:0003006 GO:0003006 developmental process involved in reproduction 13
## GO:0019953 GO:0019953 sexual reproduction 11
## GO:0019954 GO:0019954 asexual reproduction 0
## GO:0022414 GO:0022414 reproductive process 25
## GO:0032504 GO:0032504 multicellular organism reproduction 10
## GO:0032505 GO:0032505 reproduction of a single-celled organism 0
## GeneRatio
## GO:0003006 13/207
## GO:0019953 11/207
## GO:0019954 0/207
## GO:0022414 25/207
## GO:0032504 10/207
## GO:0032505 0/207
## geneID
## GO:0003006 E2F8/ASPM/KIF18A/TRIP13/AURKA/CCNB1/DACH1/BMP4/NDP/FOXA1/STC2/GATA3/PGR
## GO:0019953 ASPM/CDK1/TRIP13/KIFC1/AURKA/CCNB1/PTTG1/GAMT/BMP4/DNALI1/PGR
## GO:0019954
## GO:0022414 CDC20/TOP2A/E2F8/ASPM/NEK2/CDK1/KIF18A/TRIP13/KIFC1/IDO1/AURKA/CCNB1/PTTG1/COL16A1/DACH1/CORIN/GAMT/BMP4/MAPT/NDP/FOXA1/STC2/GATA3/DNALI1/PGR
## GO:0032504 ASPM/TRIP13/KIFC1/AURKA/CCNB1/PTTG1/GAMT/BMP4/STC2/PGR
## GO:0032505
The input parameters of gene is a vector of gene IDs (can be any ID type that supported by corresponding OrgDb
).
If readable is setting to TRUE, the input gene IDs will be converted to gene symbols.
Over-representation test3 were implemented in clusterProfiler. For calculation details and explanation of paramters, please refer to the vignette of DOSE.
ego <- enrichGO(gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(summary(ego))
## ID Description GeneRatio
## GO:0005819 GO:0005819 spindle 24/198
## GO:0005876 GO:0005876 spindle microtubule 11/198
## GO:0000793 GO:0000793 condensed chromosome 17/198
## GO:0000779 GO:0000779 condensed chromosome, centromeric region 13/198
## GO:0005875 GO:0005875 microtubule associated complex 14/198
## GO:0000776 GO:0000776 kinetochore 13/198
## BgRatio pvalue p.adjust qvalue
## GO:0005819 228/11661 7.281150e-13 1.747476e-10 1.540538e-10
## GO:0005876 46/11661 2.034306e-10 2.441167e-08 2.152081e-08
## GO:0000793 153/11661 8.307384e-10 5.383064e-08 4.745596e-08
## GO:0000779 81/11661 8.971773e-10 5.383064e-08 4.745596e-08
## GO:0005875 108/11661 3.603535e-09 1.729697e-07 1.524864e-07
## GO:0000776 97/11661 8.783371e-09 3.513349e-07 3.097294e-07
## geneID
## GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0005876 SKA1/NUSAP1/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
## GO:0000793 CENPE/NDC80/TOP2A/NCAPH/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/BIRC5/NCAPG/AURKB/CHEK1/AURKA/CCNB1
## GO:0000779 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/BIRC5/AURKB/AURKA/CCNB1
## GO:0005875 CDCA8/KIF23/CENPE/KIF18A/BIRC5/KIF11/AURKB/KIFC1/KIF18B/KIF20A/AURKA/KIF4A/MAPT/DNALI1
## GO:0000776 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/BIRC5/AURKB/CCNB1
## Count
## GO:0005819 24
## GO:0005876 11
## GO:0000793 17
## GO:0000779 13
## GO:0005875 14
## GO:0000776 13
As I mentioned before, any gene ID type that supported in OrgDb
can be directly used in GO analyses. User need to specify the keytype
parameter to specify the input gene ID type.
ego2 <- enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keytype = 'ENSEMBL',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
head(summary(ego2))
## ID Description GeneRatio
## GO:0005819 GO:0005819 spindle 28/230
## GO:0005875 GO:0005875 microtubule associated complex 19/230
## GO:0005876 GO:0005876 spindle microtubule 12/230
## GO:0042613 GO:0042613 MHC class II protein complex 14/230
## GO:0005874 GO:0005874 microtubule 26/230
## GO:0030669 GO:0030669 clathrin-coated endocytic vesicle membrane 14/230
## BgRatio pvalue p.adjust qvalue
## GO:0005819 301/19684 2.099990e-17 4.724977e-15 3.448405e-15
## GO:0005875 156/19684 2.718025e-14 3.057778e-12 2.231642e-12
## GO:0005876 56/19684 1.724237e-12 1.293178e-10 9.437930e-11
## GO:0042613 95/19684 5.264985e-12 2.961554e-10 2.161415e-10
## GO:0005874 440/19684 1.276404e-11 5.743819e-10 4.191981e-10
## GO:0030669 104/19684 1.866844e-11 6.000568e-10 4.379362e-10
## geneID
## GO:0005819 ENSG00000134690/ENSG00000117399/ENSG00000137807/ENSG00000138778/ENSG00000066279/ENSG00000126787/ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000088325/ENSG00000117650/ENSG00000170312/ENSG00000164109/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000112742/ENSG00000178999/ENSG00000198901/ENSG00000237649/ENSG00000233450/ENSG00000056678/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000087586/ENSG00000134057/ENSG00000090889
## GO:0005875 ENSG00000134690/ENSG00000137807/ENSG00000138778/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000178999/ENSG00000237649/ENSG00000233450/ENSG00000056678/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000087586/ENSG00000090889/ENSG00000186868/ENSG00000277956/ENSG00000276155/ENSG00000163879
## GO:0005876 ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000170312/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000178999/ENSG00000198901/ENSG00000186185/ENSG00000087586/ENSG00000090889
## GO:0042613 ENSG00000196735/ENSG00000232062/ENSG00000225890/ENSG00000257473/ENSG00000206301/ENSG00000223793/ENSG00000228284/ENSG00000231526/ENSG00000206305/ENSG00000236418/ENSG00000225103/ENSG00000233192/ENSG00000237541/ENSG00000231823
## GO:0005874 ENSG00000137807/ENSG00000138778/ENSG00000066279/ENSG00000154839/ENSG00000262634/ENSG00000137804/ENSG00000088325/ENSG00000117650/ENSG00000170312/ENSG00000121621/ENSG00000089685/ENSG00000138160/ENSG00000178999/ENSG00000198901/ENSG00000237649/ENSG00000233450/ENSG00000056678/ENSG00000204197/ENSG00000186185/ENSG00000112984/ENSG00000087586/ENSG00000090889/ENSG00000127603/ENSG00000186868/ENSG00000277956/ENSG00000276155
## GO:0030669 ENSG00000196735/ENSG00000232062/ENSG00000225890/ENSG00000257473/ENSG00000206301/ENSG00000223793/ENSG00000228284/ENSG00000231526/ENSG00000206305/ENSG00000236418/ENSG00000225103/ENSG00000233192/ENSG00000237541/ENSG00000231823
## Count
## GO:0005819 28
## GO:0005875 19
## GO:0005876 12
## GO:0042613 14
## GO:0005874 26
## GO:0030669 14
Gene ID can be mapped to gene Symbol by using paramter readable=TRUE
or setReadable
function.
ego2 <- setReadable(ego2, OrgDb = org.Hs.eg.db)
head(summary(ego2))
## ID Description GeneRatio
## GO:0005819 GO:0005819 spindle 28/230
## GO:0005875 GO:0005875 microtubule associated complex 19/230
## GO:0005876 GO:0005876 spindle microtubule 12/230
## GO:0042613 GO:0042613 MHC class II protein complex 14/230
## GO:0005874 GO:0005874 microtubule 26/230
## GO:0030669 GO:0030669 clathrin-coated endocytic vesicle membrane 14/230
## BgRatio pvalue p.adjust qvalue
## GO:0005819 301/19684 2.099990e-17 4.724977e-15 3.448405e-15
## GO:0005875 156/19684 2.718025e-14 3.057778e-12 2.231642e-12
## GO:0005876 56/19684 1.724237e-12 1.293178e-10 9.437930e-11
## GO:0042613 95/19684 5.264985e-12 2.961554e-10 2.161415e-10
## GO:0005874 440/19684 1.276404e-11 5.743819e-10 4.191981e-10
## GO:0030669 104/19684 1.866844e-11 6.000568e-10 4.379362e-10
## geneID
## GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/SKA1/NUSAP1/TPX2/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIFC1/KIFC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0005875 CDCA8/KIF23/CENPE/KIF18A/BIRC5/KIF11/AURKB/KIFC1/KIFC1/KIFC1/KIFC1/KIF18B/KIF20A/AURKA/KIF4A/MAPT/MAPT/MAPT/DNALI1
## GO:0005876 SKA1/SKA1/NUSAP1/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
## GO:0042613 HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1
## GO:0005874 KIF23/CENPE/ASPM/SKA1/SKA1/NUSAP1/TPX2/NEK2/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIFC1/KIFC1/KIFC1/KIFC1/KIF18B/KIF20A/AURKA/KIF4A/MACF1/MAPT/MAPT/MAPT
## GO:0030669 HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1/HLA-DQA1
## Count
## GO:0005819 28
## GO:0005875 19
## GO:0005876 12
## GO:0042613 14
## GO:0005874 26
## GO:0030669 14
enrichGO test the whole GO corpus and enriched result may contains very general terms. With dropGO function, user can remove specific GO terms or GO level from results obtained from both enrichGO and compareCluster.
enrichGO doesn’t contain parameter to restrict the test at specific GO level. Instead, we provide a function gofilter to restrict the result at specific GO level. It works with results obtained from both enrichGO and compareCluster.
According to issue #28, I implement a simplify method to redundant GO terms obtained from enrichGO. An example can be found in the blog post. It internally call GOSemSim to calculate similarities among GO terms and remove those highly similar terms by keeping one representative term. The simplify method works with both outputs from enrichGO and compareCluster.
A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA)4 directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
For algorithm details, please refer to the vignette of DOSE.
ego3 <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
nPerm = 1000,
minGSSize = 120,
pvalueCutoff = 0.01,
verbose = FALSE)
GSEA use permutation test, user can set nPerm for number of permutations. Gene Set size below minGSSize will be omitted.
GO semantic similarity can be calculated by GOSemSim1. We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.
The annotation package, KEGG.db, is not updated since 2012. It’s now pretty old and in clusterProfiler, enrichKEGG
supports downloading latest online version of KEGG data for enrichment analysis. Using KEGG.db is also supported by explicitly setting use_internal_data parameter to TRUE, but it’s not recommended.
With this new feature, organism is not restricted to those supported in previous release, it can be any species that have KEGG annotation data available in KEGG database. User should pass abbreviation of academic name to the organism parameter. The full list of KEGG supported organisms can be accessed via http://www.genome.jp/kegg/catalog/org_list.html.
kk <- enrichKEGG(gene = gene,
organism = 'hsa',
pvalueCutoff = 0.05)
head(summary(kk))
## ID Description GeneRatio
## hsa04110 hsa04110 Cell cycle 11/84
## hsa04114 hsa04114 Oocyte meiosis 10/84
## hsa03320 hsa03320 PPAR signaling pathway 7/84
## hsa04914 hsa04914 Progesterone-mediated oocyte maturation 6/84
## hsa04115 hsa04115 p53 signaling pathway 5/84
## hsa04062 hsa04062 Chemokine signaling pathway 8/84
## BgRatio pvalue p.adjust qvalue
## hsa04110 124/7086 1.902212e-07 3.195716e-05 3.143656e-05
## hsa04114 123/7086 1.606048e-06 1.349080e-04 1.327103e-04
## hsa03320 72/7086 2.020780e-05 1.131637e-03 1.113201e-03
## hsa04914 98/7086 1.022386e-03 4.294020e-02 4.224067e-02
## hsa04115 69/7086 1.287547e-03 4.326158e-02 4.255682e-02
## hsa04062 187/7086 1.591581e-03 4.456426e-02 4.383827e-02
## geneID Count
## hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232 11
## hsa04114 991/9133/983/4085/51806/6790/891/9232/3708/5241 10
## hsa03320 4312/9415/9370/5105/2167/3158/5346 7
## hsa04914 9133/890/983/4085/891/5241 6
## hsa04115 9133/6241/983/1111/891 5
## hsa04062 3627/10563/6373/4283/6362/6355/9547/1524 8
kk2 <- gseKEGG(geneList = geneList,
organism = 'hsa',
nPerm = 1000,
minGSSize = 120,
pvalueCutoff = 0.05,
verbose = FALSE)
head(summary(kk2))
## ID Description setSize enrichmentScore NES
## hsa04510 hsa04510 Focal adhesion 192 -0.4132272 -1.705104
## hsa05162 hsa05162 Measles 124 0.3894118 1.658474
## hsa05164 hsa05164 Influenza A 158 0.3618215 1.582028
## hsa03013 hsa03013 RNA transport 131 0.4116488 1.751106
## hsa05152 hsa05152 Tuberculosis 161 0.3695479 1.622820
## hsa05203 hsa05203 Viral carcinogenesis 168 0.3506073 1.531070
## pvalue p.adjust qvalues
## hsa04510 0.001388889 0.01843137 0.01155831
## hsa05162 0.003003003 0.01843137 0.01155831
## hsa05164 0.003125000 0.01843137 0.01155831
## hsa03013 0.003174603 0.01843137 0.01155831
## hsa05152 0.003289474 0.01843137 0.01155831
## hsa05203 0.003311258 0.01843137 0.01155831
KEGG Module is a collection of manually defined function units. In some situation, KEGG Modules have a more straightforward interpretation.
mkk <- enrichMKEGG(gene = gene,
organism = 'hsa')
mkk2 <- gseMKEGG(geneList = geneList,
species = 'hsa')
DOSE5 supports Disease Ontology (DO) Semantic and Enrichment analysis, please refer to the package vignettes. The enrichDO
function is very useful for identifying disease association of interesting genes, and function gseAnalyzer
function is designed for gene set enrichment analysis of DO.
ReactomePA6 uses Reactome as a source of pathway data. The function call of enrichPathway
and gsePathway
in ReactomePA is consistent with enrichKEGG
and gseKEGG
.
clusterProfiler provides enrichment and GSEA analysis with GO, KEGG, DO and Reactome pathway supported internally, some user may prefer GO and KEGG analysis with DAVID7 and still attracted by the visualization methods provided by clusterProfiler???. To bridge the gap between DAVID and clusterProfiler, we implemented enrichDAVID
. This function query enrichment analysis result from DAVID webserver via RDAVIDWebService8 and stored the result as an enrichResult
instance, so that we can use all the visualization functions in clusterProfiler to visualize DAVID results. enrichDAVID
is fully compatible with compareCluster
function and comparing enrichment results from different gene clusters is now available with DAVID.
david <- enrichDAVID(gene = gene,
idType = "ENTREZ_GENE_ID",
listType = "Gene",
annotation = "KEGG_PATHWAY",
david.user = "clusterProfiler@hku.hk")
DAVID Web Service has the following limitations:
For more details, please refer to http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html.
As user has limited usage, please register and use your own user account to run enrichDAVID.
clusterProfiler supports both hypergeometric test and gene set enrichment analysis of many ontology/pathway, but it’s still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations.
clusterProfiler provides enricher
function for hypergeometric test and GSEA
function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.
An example of using enricher
and GSEA
to analyze DisGeNet annotation is presented in the post, use clusterProfiler as an universal enrichment analysis tool.
Users can use enricher
and GSEA
function to analyze gene set collections downloaded from Molecular Signatures Database (MSigDb). clusterProfiler provides a function, read.gmt
, to parse the gmt file into a TERM2GENE data.frame
that is ready for both enricher
and GSEA
functions.
gmtfile <- system.file("extdata", "c5.cc.v5.0.entrez.gmt", package="clusterProfiler")
c5 <- read.gmt(gmtfile)
egmt <- enricher(gene, TERM2GENE=c5)
head(summary(egmt))
## ID Description
## SPINDLE SPINDLE SPINDLE
## MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON
## CYTOSKELETAL_PART CYTOSKELETAL_PART CYTOSKELETAL_PART
## SPINDLE_MICROTUBULE SPINDLE_MICROTUBULE SPINDLE_MICROTUBULE
## MICROTUBULE MICROTUBULE MICROTUBULE
## CYTOSKELETON CYTOSKELETON CYTOSKELETON
## GeneRatio BgRatio pvalue p.adjust
## SPINDLE 11/82 39/5270 7.667674e-12 5.060665e-10
## MICROTUBULE_CYTOSKELETON 16/82 152/5270 8.449298e-10 2.788268e-08
## CYTOSKELETAL_PART 15/82 235/5270 2.414879e-06 5.083064e-05
## SPINDLE_MICROTUBULE 5/82 16/5270 3.080645e-06 5.083064e-05
## MICROTUBULE 6/82 32/5270 7.740446e-06 1.021739e-04
## CYTOSKELETON 16/82 367/5270 1.308357e-04 1.439193e-03
## qvalue
## SPINDLE 4.035618e-10
## MICROTUBULE_CYTOSKELETON 2.223499e-08
## CYTOSKELETAL_PART 4.053480e-05
## SPINDLE_MICROTUBULE 4.053480e-05
## MICROTUBULE 8.147838e-05
## CYTOSKELETON 1.147682e-03
## geneID
## SPINDLE 991/9493/9787/22974/983/332/3832/7272/9055/6790/24137
## MICROTUBULE_CYTOSKELETON 991/9493/9133/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
## CYTOSKELETAL_PART 991/9493/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
## SPINDLE_MICROTUBULE 983/332/3832/9055/24137
## MICROTUBULE 983/332/3832/9055/24137/4137
## CYTOSKELETON 991/9493/9133/7153/9787/22974/4751/983/332/3832/7272/9055/6790/24137/4137/7802
## Count
## SPINDLE 11
## MICROTUBULE_CYTOSKELETON 16
## CYTOSKELETAL_PART 15
## SPINDLE_MICROTUBULE 5
## MICROTUBULE 6
## CYTOSKELETON 16
Functional analysis using NGS data (eg, RNA-Seq and ChIP-Seq) can be performed by linking coding and non-coding regions to coding genes via ChIPseeker9 package, which can annotates genomic regions to their nearest genes, host genes, and flanking genes respectivly. In addtion, it provides a function, seq2gene, that simultaneously considering host genes, promoter region and flanking gene from intergenic region that may under control via cis-regulation. This function maps genomic regions to genes in a many-to-many manner and facilitate functional analysis. For more details, please refer to ChIPseeker.
The function calls of groupGO
, enrichGO
, enrichKEGG
, enrichDO
and enrichPathway
are consistent and all the output can be visualized by bar plot, enrichment map and category-gene-network plot. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart.
barplot(ggo, drop=TRUE, showCategory=12)
barplot(ego, showCategory=8)
Enrichment map can be viusalized by enrichMap
, which also support results obtained from hypergeometric test and gene set enrichment analysis.
enrichMap(ego)
In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, we developed cnetplot
function to extract the complex association.
cnetplot(ego, categorySize="pvalue", foldChange=geneList)
cnetplot(kk, categorySize="geneNum", foldChange=geneList)
## plotGOgraph
plotGOgraph
, which is based on topGO, can accept output of enrichGO
and visualized the enriched GO induced graph.
plotGOgraph(ego)
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 31
## Number of Edges = 52
##
## $complete.dag
## [1] "A graph with 31 nodes."
Running score of gene set enrichment analysis and its association of phenotype can be visualized by gseaplot
.
gseaplot(kk2, geneSetID = "hsa04145")
clusterProfiler users can also use pathview
from the pathview10 to visualize KEGG pathway.
The following example illustrate how to visualize “hsa04110” pathway, which was enriched in our previous analysis.
library("pathview")
hsa04110 <- pathview(gene.data = geneList,
pathway.id = "hsa04110",
species = "hsa",
limit = list(gene=max(abs(geneList)), cpd=1))
For further information, please refer to the vignette of pathview10.
clusterProfiler was developed for biological theme comparison2, and it provides a function, compareCluster
, to automatically calculate enriched functional categories of each gene clusters.
data(gcSample)
lapply(gcSample, head)
## $X1
## [1] "4597" "7111" "5266" "2175" "755" "23046"
##
## $X2
## [1] "23450" "5160" "7126" "26118" "8452" "3675"
##
## $X3
## [1] "894" "7057" "22906" "3339" "10449" "6566"
##
## $X4
## [1] "5573" "7453" "5245" "23450" "6500" "4926"
##
## $X5
## [1] "5982" "7318" "6352" "2101" "8882" "7803"
##
## $X6
## [1] "5337" "9295" "4035" "811" "23365" "4629"
##
## $X7
## [1] "2621" "2665" "5690" "3608" "3550" "533"
##
## $X8
## [1] "2665" "4735" "1327" "3192" "5573" "9528"
The input for geneCluster parameter should be a named list of gene IDs. To speed up the compilation of this document, we set use_internal_data = TRUE
.
ck <- compareCluster(geneCluster = gcSample, fun = "enrichKEGG")
head(summary(ck))
## Cluster ID Description GeneRatio BgRatio
## 1 X2 hsa04110 Cell cycle 18/348 124/7086
## 2 X2 hsa05340 Primary immunodeficiency 8/348 37/7086
## 3 X2 hsa05200 Pathways in cancer 35/348 397/7086
## 4 X2 hsa04064 NF-kappa B signaling pathway 13/348 93/7086
## 5 X3 hsa04512 ECM-receptor interaction 9/166 82/7086
## 6 X4 hsa04110 Cell cycle 20/374 124/7086
## pvalue p.adjust qvalue
## 1 3.076916e-05 0.007969212 0.007546541
## 2 3.441453e-04 0.036613844 0.034671919
## 3 4.827988e-04 0.036613844 0.034671919
## 4 5.654648e-04 0.036613844 0.034671919
## 5 1.157498e-04 0.026043698 0.024246530
## 6 6.416823e-06 0.001423960 0.001174818
## geneID
## 1 991/1869/890/1871/701/990/10926/9088/8317/9700/9134/1029/2810/699/11200/23594/8555/4173
## 2 100/6891/3932/973/916/925/958/64421
## 3 3675/1956/1869/324/3480/1871/113/1902/2261/1909/637/355/5888/9134/5915/3908/2246/5154/7704/4437/1029/185/7187/3551/3479/332/5733/330/6654/1288/5914/405/54583/2122/6772
## 4 4067/3383/7128/3932/5971/4050/6850/7187/3551/10892/5588/330/958
## 5 7057/3339/1299/3695/1101/3679/3910/3696/3693
## 6 6500/9184/4172/994/4175/4171/1387/10274/8697/902/4616/5591/4176/8881/7043/983/1022/1028/891/4173
## Count
## 1 18
## 2 8
## 3 35
## 4 13
## 5 9
## 6 20
compareCluster
also supports passing a formula (the code to support formula has been contributed by Giovanni Dall’Olio) of type \(Entrez \sim group\) or \(Entrez \sim group + othergroup\).
## formula interface
mydf <- data.frame(Entrez=c('1', '100', '1000', '100101467',
'100127206', '100128071'),
group = c('A', 'A', 'A', 'B', 'B', 'B'),
othergroup = c('good', 'good', 'bad', 'bad',
'good', 'bad'))
xx.formula <- compareCluster(Entrez~group, data=mydf, fun='groupGO', OrgDb='org.Hs.eg.db')
head(summary(xx.formula))
## Cluster group ID Description Count GeneRatio geneID
## 1 A A GO:0016020 membrane 2 2/3 100/1000
## 2 A A GO:0005576 extracellular region 3 3/3 1/100/1000
## 3 A A GO:0005623 cell 2 2/3 100/1000
## 4 A A GO:0009295 nucleoid 0 0/3
## 5 A A GO:0019012 virion 0 0/3
## 6 A A GO:0030054 cell junction 2 2/3 100/1000
## formula interface with more than one grouping variable
xx.formula.twogroups <- compareCluster(Entrez~group+othergroup,
data=mydf, fun='groupGO', OrgDb='org.Hs.eg.db')
head(summary(xx.formula.twogroups))
## Cluster group othergroup ID Description Count GeneRatio
## 1 A.bad A bad GO:0016020 membrane 1 1/1
## 2 A.bad A bad GO:0005576 extracellular region 1 1/1
## 3 A.bad A bad GO:0005623 cell 1 1/1
## 4 A.bad A bad GO:0009295 nucleoid 0 0/1
## 5 A.bad A bad GO:0019012 virion 0 0/1
## 6 A.bad A bad GO:0030054 cell junction 1 1/1
## geneID
## 1 1000
## 2 1000
## 3 1000
## 4
## 5
## 6 1000
We can visualize the result using plot
method.
plot(ck)
By default, only top 5 (most significant) categories of each cluster was plotted. User can changes the parameter showCategory to specify how many categories of each cluster to be plotted, and if showCategory was set to NULL, the whole result will be plotted.
The plot function accepts a parameter by for setting the scale of dot sizes. The default parameter by is setting to “geneRatio”, which corresponding to the “GeneRatio” column of the output. If it was setting to count, the comparison will be based on gene counts, while if setting to rowPercentage, the dot sizes will be normalized by count/(sum of each row)
To provide the full information, we also provide number of identified genes in each category (numbers in parentheses) when by is setting to rowPercentage and number of gene clusters in each cluster label (numbers in parentheses) when by is setting to geneRatio, as shown in Figure 3. If the dot sizes were based on count, the row numbers will not shown.
The p-values indicate that which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding p-values. Color gradient ranging from red to blue correspond to in order of increasing p-values. That is, red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). P-values and adjusted p-values were filtered out by the threshold giving by parameter pvalueCutoff, and FDR can be estimated by qvalue.
User can refer to the example in2; we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus)11. We identified 8 gene clusters from differentially expressed genes, and using compareCluster
to compare these gene clusters by their enriched biological process.
Another example was shown in12, we calculated functional similarities among viral miRNAs using method described in13, and compared significant KEGG pathways regulated by different viruses using compareCluster
.
The comparison function was designed as a framework for comparing gene clusters of any kind of ontology associations, not only groupGO
, enrichGO
, enrichKEGG
and enricher
provided in this package, but also other biological and biomedical ontologies, for instance, enrichDO
from DOSE5 and enrichPathway
from ReactomePA work fine with compareCluster
for comparing biological themes in disease and reactome pathway perspective. More details can be found in the vignettes of DOSE5 and ReactomePA.
More documents can be found in http://www.bioconductor.org/packages/DOSE, http://www.bioconductor.org/packages/ReactomePA and http://guangchuangyu.github.io/tags/clusterprofiler.
If you have any, let me know.
Here is the output of sessionInfo()
on the system on which this document was compiled:
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] Rgraphviz_2.16.0 graph_1.50.0 SparseM_1.7
## [4] clusterProfiler_3.0.5 org.Hs.eg.db_3.3.0 AnnotationHub_2.4.2
## [7] GO.db_3.3.0 AnnotationDbi_1.34.4 IRanges_2.6.1
## [10] S4Vectors_0.10.3 Biobase_2.32.0 BiocGenerics_0.18.0
## [13] DOSE_2.10.7 BiocStyle_2.0.3
##
## loaded via a namespace (and not attached):
## [1] qvalue_2.4.2 reshape2_1.4.1
## [3] splines_3.3.1 lattice_0.20-33
## [5] colorspace_1.2-6 htmltools_0.3.5
## [7] yaml_2.1.13 interactiveDisplayBase_1.10.3
## [9] XML_3.98-1.4 DBI_0.5
## [11] topGO_2.24.0 matrixStats_0.50.2
## [13] plyr_1.8.4 stringr_1.0.0
## [15] munsell_0.4.3 GOSemSim_1.30.3
## [17] gtable_0.2.0 evaluate_0.9
## [19] labeling_0.3 knitr_1.14
## [21] httpuv_1.3.3 BiocInstaller_1.22.3
## [23] curl_1.2 GSEABase_1.34.0
## [25] Rcpp_0.12.6 xtable_1.8-2
## [27] scales_0.4.0 formatR_1.4
## [29] DO.db_2.9 annotate_1.50.0
## [31] mime_0.5 ggplot2_2.1.0
## [33] digest_0.6.10 stringi_1.1.1
## [35] shiny_0.13.2 tools_3.3.1
## [37] magrittr_1.5 RSQLite_1.0.0
## [39] tibble_1.1 tidyr_0.6.0
## [41] assertthat_0.1 rmarkdown_1.0
## [43] httr_1.2.1 R6_2.1.2
## [45] igraph_1.0.1
1. Yu, G. et al. GOSemSim: An r package for measuring semantic similarity among gO terms and gene products. Bioinformatics 26, 976–978 (2010).
2. Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. ClusterProfiler: An r package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16, 284–287 (2012).
3. Boyle, E. I. et al. GO::TermFinder–open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics (Oxford, England) 20, 3710–3715 (2004).
4. Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 102, 15545–15550 (2005).
5. Yu, G., Wang, L.-G., Yan, G.-R. & He, Q.-Y. DOSE: An r/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609 (2015).
6. Yu, G. & He, Q.-Y. ReactomePA: An r/Bioconductor package for reactome pathway analysis and visualization. Mol. BioSyst. 12, 477–479 (2016).
7. Huang, D. et al. The DAVID gene functional classification tool: A novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biology 8, R183 (2007).
8. Fresno, C. & Fernández, E. A. RDAVIDWebService: A versatile r interface to DAVID. Bioinformatics 29, 2810–2811 (2013).
9. Yu, G., Wang, L.-G. & He, Q.-Y. ChIPseeker: An r/Bioconductor package for chIP peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383 (2015).
10. Luo, W. & Brouwer, C. Pathview: An R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics 29, 1830–1831 (2013).
11. Schmidt, M. et al. The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Research 68, 5405–5413 (2008).
12. Yu, G. & He, Q. Functional similarity analysis of human virus-encoded miRNAs. Journal of Clinical Bioinformatics 1, 15 (2011).
13. Yu, G. et al. A new method for measuring functional similarity of microRNAs. Journal of Integrated OMICS 1, 49–54 (2011).