Too many enriched GO terms using Goseq
0
0
Entering edit mode
11 months ago
tianshenbio ▴ 90

I have a genome of 22129 genes and I got a list of 2905 DE genes I used Goseq to perform GO enrichment analysis but got a list of more than 500 significantly enriched GO terms (p<0.05). How can I get a manageable number of enriched GO terms? Is it because the number of DE genes is too large?

Here's how I perform the enrichment analysis (bias.data - data corrected by gene length)

pwf <- nullp(gene.data, bias.data = genes.bias.data, plot.fit = FALSE)
GO.wall <- goseq(pwf, gene2cat = gene2go_data, method = "Wallenius", use_genes_without_cat = FALSE)

goseq RNA-Seq enrichment GO • 433 views
1
Entering edit mode

It is indeed probably due to your list of DEGs being large. You may also observe that the top of your enrichment list is populated by big GO terms which cover lots of genes (hundreds), because those usually have more power to be detected as enriched due to their big n. One thing you could do is to summarize your list with tools such as REViGO. You input your list of enriched GO terms (accompanied by p-value) and it will collapse redundant categories (semantically) giving you a smaller and more manageable view of the affected pathways.

(Additionally, if you have many DEGs you may filter them by fold change and retain only the biggest changes to have a smaller list)

0
Entering edit mode

Thank you for your suggestion. I may consider using padj<0.001 and log2FC>1 to filter my DE genes.

1
Entering edit mode

Keep in mind that your log2FC values will be positive and negative (for up-regulation and down-regulation, depending on how you specified the contrast). So filter for absolute values of log2FC (>1 will be 2x increase (2 FC) and <-1 will be 2x decrease (0.5 FC)).

On a side note, if you haven't tried, maybe you could separate up-regulated and down-regulated genes, and this will also result on smaller lists of DEGs which will "clean" your GO results. Nonetheless, all of this depends on your biological question at hand.

Personally, I believe that fold-change filtering is the more biologically sound choice, which is what you're probably aiming for when doing subsequent pathway enrichment analyses. Once you have the "safety" of your multiple-testing correction (your p-adj) you're OK to go, and maybe going to lower p-adj will probably bias to low-variance genes instead of genes with strong changes.

0
Entering edit mode

Thank you for your answer! Actually I have played around with all these factors, it's just not that easy to decide what would be the best criteria to perform this kind of analysis...