Proper way(s) to perform enrichment analysis in R
1
0
Entering edit mode
26 days ago
cwwong13 • 0

I am not sure what is the proper way to carry out over-representation analysis (and also gene set enrichment analysis) for RNAseq data. Ideally, the analysis can be performed in R, otherwise, if the software/ platform can export the output file (also include all the non-statistical-significant term) will also be sufficient.

I come up with this question is mainly because I recently discovered a website: https://maayanlab.cloud/Enrichr/

I found that this website host a list of pathway/ ontology databases (quite comprehensively). However, the p-value calculated there cannot be reproduced with two other methods I previously used: using the R package clusterProfiler and the web of g:Profiler. Even more surprising to me, the results from clusterProfiler and g:Profiler are also not the same, i.e. I got three distinct results from three analysis platforms while I am using the same gene list, and same gmt file.

For reproducibility purposes (only the R section), here are the codes I used for clusterProfiler (I just recently upgraded to the latest version (v4.0.5):

interests <- c("TNF", "PTGS2", "ZFP36", "GHRL", "CDH6", "TNFRSF1A", "IL6", "CXCL8", "CD27", "PTHLH", "IL1B", "CCL2", "PPARA", "CCL3", "AGT", "CCL5", "IL10", "ICAM1", "EDN1", "CAT", "CXCL10", "NOS2", "COL3A1",  "IL1RN", "TIMP1", "FABP4", "MMP3", "RELA", "COL1A1", "VCAM1", "CALCA", "CASP3", "TGFB1", "SERPINE1", "IL18", "IGF1", "VEGFA", "SLC22A1", "SOD2", "LPL", "NOS3", "PTGS1", "IFNG", "FN1", "MMP8")
wp <- read.gmt(url("https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=GO_Cellular_Component_2021"))
ewp <- enricher(interests, TERM2GENE=wp[,c("term", "gene")], TERM2NAME=wp[,c("term", "term")])
head(ewp)

Then, the most significant term is platelet alpha granule lumen (GO:0031093) with pvalue = 1.658067e-8, p.adjust = 8.456139e-7, and qvalue = 6.283200e-7

I then use the same gene list and query against the g:Profiler by manually uploading the same gmt file. The platelet alpha granule lumen (GO:0031093) has p.adjust = 1.642×10-6.

Finally, the original, enrichr returns the pavlue = 9.476e-9 and p.adjust = 5.780e-7.

Does anyone know what is the underlying cause of the differences? I tried to look up the underlying formula for the clusterProfiler, and find that it uses

pvalues <- apply(args.df, 1, function(n) phyper(n[1], n[2], n[3], n[4], lower.tail = FALSE))

More questions are thus raised:

  1. Should we use a two-tail test for the null hypothesis? I read the article Isabelle Rivals, et al mentioned we probably should not use a one-tail test.
  2. Assuming the one-tail test is justified, how can enrichr get a much lower p-value (I assume maybe the g:Profiler used the two-tail test, but not sure where to check their internal algorithm)?
  3. enrichr stated that they use fisher's exact test, but the results are obviously off. I am not sure why.
  4. Generally, do you have advice on where to perform the GSEA instead of the over-representation analysis I did here.
GSEA analysis enrichment R RNAseq • 350 views
ADD COMMENT
1
Entering edit mode

4.Generally, do you have advice on where to perform the GSEA instead of the over-representation analysis I did here.

Check this post: Gene Set Enrichment Analysis

ADD REPLY
3
Entering edit mode
26 days ago

If you look only for enrichment a one-tail should be fine (and actually correct), you would need two-tail if you also for test for "depletion". The tests may differ because they use either Fisher's exact or the Chi-squared test (but they should become similar if N is large). Also how they define the "background" is very important and very much affects the p-value.

To perform GSEA, I use the fGSEA (fast GSEA) R package. It is much-much faster than the original GSEA implementation of the Broad. It's a great package and the results are exactly the same as the original.

ADD COMMENT
0
Entering edit mode

One follow-up question is: do you think the gene set size also matters? I find that many software has the default set size of 10 - 500 genes per set. I am not sure whether that also needs to be optimized.

ADD REPLY
0
Entering edit mode

I find 10-500 reasonable. There is no "optimum" value. Sets that are too small may become unreliable, in most cases the p-value will reflect that but it is mostly unnecessary computation. Also sets that are greater than 500 become very generic and do not say much.

ADD REPLY

Login before adding your answer.

Traffic: 2070 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6