Question: Trimming redundant gene sets after gsea analysis
gravatar for fizer
4.4 years ago by
fizer30 wrote:


I am performing gsea with p-values and foldchanges of genes (homosapiens) obtained from rna-seq data. Is it a good idea to do reduction of redundant gene sets afterwards? Because when I plot the results from gsea analysis as a network plot, too many terms that are significant are plotted and plot becomes very hard to read. I know that people do redundant term reduction before or after GO over-representation analysis (hypergeometric test) but I am not sure if it should be done after GSEA type analysis. I want to keep significant term of specific level and remove general term if the term contains >=50% genes as compared to general terms levels. Is there any method available? Suggestions please.

ADD COMMENTlink modified 4.4 years ago by alserg640 • written 4.4 years ago by fizer30

I am not sure for GSEA results...

But for GO enrichment analysis with goseq, I usually remove the too specific and too general terms for plots. I have written a R package called gogadget (gogadget: an R package for go analysis visualization and interpretation ), with a filter function.

But there are more tools available such as REVIGO or GO trimming.

Good luck!

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Benn8.1k
gravatar for alserg
4.4 years ago by
alserg640 wrote:

One of the things I was playing with to reduce redundant gene sets was bayesian-network-like filtering. There is example of it here:

The idea is the following. Let's consider two enriched overlapping pathways p1 and p2. First, let's make a hypothesis that p1 is truely enriched and p2 is just piggybacked to it because of the overlap. You can test this hypothesis by looking at genes unique to p2, that is setdiff(p2, p1). If for these genes you also have enrichment, that the hypothesis is false and you better keep p2. You can also check the other way, whether p1 have some unique enrichment compared to p2. By repeating this operation you can come up with a list of uniquely enriched pathways.

This not only removes redundant pathways, but also it will leave pathways at the most enriched level, which I found useful. However, there several arbitrary thresholds for p-values involved, so one need to be a little careful with interpretation. Otherwise it worked pretty well for me.

ADD COMMENTlink written 4.4 years ago by alserg640


I have a similar issue as in the original question: RNAseq data where a sample has a lot of very significant changes (around 10,000 genes) that I analysed with GSEA only to get hundreds of gene sets significantly enriched. However, I have the feeling most of them are just redundant as names are very similar and they often share a large number of genes. So here I am, looking for a way to trim redundant gene sets.

I tried your script but unfortunately I can't seem to make it work. I'm still quite a beginner with R so please bear with me. When I try to run the script as it is (i.e. using the example files in the fgsea package) I get the following error:

> elimRes <- eliminatePathways(universe=names(exampleRanks),
+                              pathways=examplePathways,
+                              pathway2name=pathway2name,
+                              isNonRandomPval=fgseaIsNonRandomPval(examplePathways, exampleRanks, nperm=5000))
Testing pathway '5990980_Cell_Cycle'
Testing pathway '5990979_Cell_Cycle,_Mitotic'
Error in colnamesInt(x, neworder, check_dups = TRUE) : 
  argument specifying columns specify non existing column(s): cols[1]='pathway'

I have searched and looked everywhere and tried a few things but don't understand where the issue is. Would you be able to help?

Also I have tried to run the script with my own .rnk and .gmt files. In that case I get an error already at the GSEA results:

> fgseaRes <- fgsea(pathways = examplePathways, 
+                   stats = exampleRanks,
+                   minSize=15,
+                   maxSize=500,
+                   nperm=10000)
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) : 
  undefined columns selected
In addition: Warning message:
In fgsea(pathways = examplePathways, stats = exampleRanks, minSize = 15,  :
  There are ties in the preranked stats (50% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

Any suggestions for a way around this?


ADD REPLYlink modified 10 months ago • written 10 months ago by MarcoL0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1662 users visited in the last hour