Question

Trimming redundant gene sets after gsea analysis

0

Entering edit mode

7.5 years ago

fizer ▴ 30

Hi,

I am performing gsea with p-values and foldchanges of genes (homosapiens) obtained from rna-seq data. Is it a good idea to do reduction of redundant gene sets afterwards? Because when I plot the results from gsea analysis as a network plot, too many terms that are significant are plotted and plot becomes very hard to read. I know that people do redundant term reduction before or after GO over-representation analysis (hypergeometric test) but I am not sure if it should be done after GSEA type analysis. I want to keep significant term of specific level and remove general term if the term contains >=50% genes as compared to general terms levels. Is there any method available? Suggestions please.

GSEA Term Reduction GO analysis • 4.4k views

ADD COMMENT • link updated 7.5 years ago by alserg ▴ 920 • written 7.5 years ago by fizer ▴ 30

0

Entering edit mode

I am not sure for GSEA results...

But for GO enrichment analysis with goseq, I usually remove the too specific and too general terms for plots. I have written a R package called gogadget (gogadget: an R package for go analysis visualization and interpretation ), with a filter function.

But there are more tools available such as REVIGO or GO trimming.

Good luck!

ADD REPLY • link 7.5 years ago by Benn 8.3k

score 0 · Answer 1 · 2016-10-14

0

Entering edit mode

7.5 years ago

alserg ▴ 920

One of the things I was playing with to reduce redundant gene sets was bayesian-network-like filtering. There is example of it here:

The idea is the following. Let's consider two enriched overlapping pathways p1 and p2. First, let's make a hypothesis that p1 is truely enriched and p2 is just piggybacked to it because of the overlap. You can test this hypothesis by looking at genes unique to p2, that is setdiff(p2, p1). If for these genes you also have enrichment, that the hypothesis is false and you better keep p2. You can also check the other way, whether p1 have some unique enrichment compared to p2. By repeating this operation you can come up with a list of uniquely enriched pathways.

This not only removes redundant pathways, but also it will leave pathways at the most enriched level, which I found useful. However, there several arbitrary thresholds for p-values involved, so one need to be a little careful with interpretation. Otherwise it worked pretty well for me.

ADD COMMENT • link 7.5 years ago by alserg ▴ 920

0

Entering edit mode

Hi,

I have a similar issue as in the original question: RNAseq data where a sample has a lot of very significant changes (around 10,000 genes) that I analysed with GSEA only to get hundreds of gene sets significantly enriched. However, I have the feeling most of them are just redundant as names are very similar and they often share a large number of genes. So here I am, looking for a way to trim redundant gene sets.

I tried your script but unfortunately I can't seem to make it work. I'm still quite a beginner with R so please bear with me. When I try to run the script as it is (i.e. using the example files in the fgsea package) I get the following error:

> elimRes <- eliminatePathways(universe=names(exampleRanks),
+                              pathways=examplePathways,
+                              pathway2name=pathway2name,
+                              isNonRandomPval=fgseaIsNonRandomPval(examplePathways, exampleRanks, nperm=5000))
Testing pathway '5990980_Cell_Cycle'
Testing pathway '5990979_Cell_Cycle,_Mitotic'
Error in colnamesInt(x, neworder, check_dups = TRUE) : 
  argument specifying columns specify non existing column(s): cols[1]='pathway'

I have searched and looked everywhere and tried a few things but don't understand where the issue is. Would you be able to help?

Also I have tried to run the script with my own .rnk and .gmt files. In that case I get an error already at the GSEA results:

> fgseaRes <- fgsea(pathways = examplePathways, 
+                   stats = exampleRanks,
+                   minSize=15,
+                   maxSize=500,
+                   nperm=10000)
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) : 
  undefined columns selected
In addition: Warning message:
In fgsea(pathways = examplePathways, stats = exampleRanks, minSize = 15,  :
  There are ties in the preranked stats (50% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

Any suggestions for a way around this?

Thanks

ADD REPLY • link 4.0 years ago by MarcoL • 0