Question

Thousands of significant differentially expressed genes but no significant enriched pathways by GSEA?

0

Entering edit mode

4.4 years ago

88adavis • 0

Hello everyone,

I've recently performed a GSEA analysis (using fGSEA in R) on an RNA-seq data set, consisting of 4 pairwise comparisons: 1.) drug 1 vs. untreated, 2.) drug 2 vs. untreated, and 3.) drug 1 + 2 vs. untreated).

Using edgeR, I found that there were ~2000 differentially expressed genes (DEGs) for drug 1 vs. untreated, ~7000 DEGs for drug 2 vs. untreated, and >9000 genes for drug 1+2 vs. untreated. After ranking the full list of genes in each of the 3 signatures (by the following formula: -log10(adjusted p-value)/sign(logFC)), I performed GSEA using fGSEA (v. 1.10.1) in R, against the Human GO and Reactome databases (~3700 pathways). After applying a Benjamini-Hochberg p-value adjustment, I obtained the following number of ‘significant’ gene sets/pathways (adj. p < 0.05): 1.) 70 for drug 1 vs. no tx, 2.) 0 for drug 2 vs. no tx and 3.) 106 for drug 1 + 2.

I find it strange that there aren’t any ‘significant’ gene sets obtained for the ‘drug 2 vs. no tx’ signature, given the large number of DEGs found. In fact, this signature shared more than 5000 significant DEGs (with same sign(logFC) direction) with the combo drug 1 + 2 treatment.

I have 2 questions: 1.) Is it possible to have no significant pathways despite having such a large fraction of the transcriptome being differentially expressed? If so, what conditions would account for that? 2.) I ranked my genes using -log10(adjusted p-value)/sign(logFC) (instead of raw p-value, as noted here), as I only had access to logFC and B-H adjusted p-values for each of the gene signatures. Is it appropriate to rank on adj. p-value? Could this account for the few enriched pathways observed for ‘drug 2 vs. no tx’?

Thanks, Andrew

RNA-Seq gsea statistics adjusted p-value • 4.1k views

ADD COMMENT • link updated 4.4 years ago by alserg ▴ 920 • written 4.4 years ago by 88adavis • 0

score 2 · Answer 1 · 2019-11-19

1.) Is it possible to have no significant pathways despite having such a large fraction of the transcriptome being differentially expressed? If so, what conditions would account for that?

Yes, absolutely. Let's say, for easy math, that your species has 1000 genes and contains a pathway consisting of 10 of those genes. You do an experiment and find 100 DEGs. If you were to randomly choose 100 genes, you would expect 1 of these on average to be part of that pathway. So if your DEGs include, say, 7 of those 10 genes, that's clearly significant. Even 3 or 4 might be good enough for significance. But if you have 500 DEGs, now a random sampling of 500 genes is going to have 5 of the 10 genes from that pathway in it. It's much harder to get a significant enrichment for that pathway now. If your DEG set contains 6 or 7 of those genes, that might not be significant.

You could try a more stringent significance cut-off for your DEGs, or look for something like a batch effect or other technical bias that could be causing so many DEGs. Do you expect the drugs you're testing to have such a huge effect on gene expression?

score 1 · Answer 2 · 2019-11-20

1

Entering edit mode

4.4 years ago

Kristoffer Vitting-Seerup ★ 4.0k

Did you try a functional-class scorring gene-set analysis? Aka one that does not rely on the cutoff on you previous DE analysis. Since you are using edgeR you can do that directly with the camera() function (technically it is from limma - but edgeR loads limma so it is redily available and accepts edgeR objects).

ADD COMMENT • link 4.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k

score 1 · Answer 3 · 2019-11-28

1

Entering edit mode

4.4 years ago

alserg ▴ 920

What value of nperm did you use for fgsea? Low values, like 1000, or sometimes even 10000 will limit the minial possible nominal P-value, which after BH-correction can lead to non-significant results.

ADD COMMENT • link 4.4 years ago by alserg ▴ 920

0

Entering edit mode

I've upgraded to using fgseaMultilevel (fgsea v. 1.14) on my DEG lists (each of which contain 2000-3000 significant DEGs (adjP<0.05 each)) and am getting few, if any significantly enriched gene sets. Originally I was testing against ~5500 gene sets from REACTOME, KEGG, and GO, so I thought I may be heavily penalized by testing against so many gene sets, so I reduced the number by decreasing "maxsize" to 100, and only running against REACTOME, but the results have not really improved.

I also tried fgseaSimple with nperm as high as 100,000 and that actually made things "worse".

ADD REPLY • link 3.5 years ago by 88adavis • 0

1

Entering edit mode

The number of DEGs doesn't correlate that well with the number of pathways. From what we saw, sometimes when you have some disruption of gene expression machinery, like CTCF or chromatin, there are many deferentially expressed genes, but it's not specific to any particular pathway.

ADD REPLY • link 3.5 years ago by alserg ▴ 920

0

Entering edit mode

That makes sense and is aligned with some of our ideas. My treatments include various retinoids, so I would have expected (at least) some significant enrichment for retinoic acid-specific transcriptional programs. Would a loosening of our p-value (adjP) cut-offs be appropriate (e.g. adjP<0.2)?

ADD REPLY • link 3.5 years ago by 88adavis • 0

0

Entering edit mode

No, I don't recommend to loos P-value threshold, unless you do some kind of experimental validation follow-up. GSEA is known to be too sensitive, so if anything it make sense to use more strict threshold.

You can try to check out some other pathway databases. Just try to upload your regulated genes (one direction at a time) into Enrichr (https://maayanlab.cloud/Enrichr/). I had an experience with Vitamin D receptor, which as far as I understand should be similar, and we didn't have much hits in Reactome/KEGG/GO, but there was a nice pathway in WikiPathway and some chip-seq based pathways from ChEA.

ADD REPLY • link 3.5 years ago by alserg ▴ 920