Trimming redundant gene sets after gsea analysis

Entering edit mode

8.7 years ago

fizer ▴ 30

Hi,

I am performing gsea with p-values and foldchanges of genes (homosapiens) obtained from rna-seq data. Is it a good idea to do reduction of redundant gene sets afterwards? Because when I plot the results from gsea analysis as a network plot, too many terms that are significant are plotted and plot becomes very hard to read. I know that people do redundant term reduction before or after GO over-representation analysis (hypergeometric test) but I am not sure if it should be done after GSEA type analysis. I want to keep significant term of specific level and remove general term if the term contains >=50% genes as compared to general terms levels. Is there any method available? Suggestions please.

GSEA Term Reduction GO analysis • 5.4k views

ADD COMMENT • link updated 8.7 years ago by alserg ▴ 1000 • written 8.7 years ago by fizer ▴ 30

Entering edit mode

I am not sure for GSEA results...

But for GO enrichment analysis with goseq, I usually remove the too specific and too general terms for plots. I have written a R package called gogadget (gogadget: an R package for go analysis visualization and interpretation ), with a filter function.

But there are more tools available such as REVIGO or GO trimming.

Good luck!

ADD REPLY • link 8.7 years ago by Benn 8.4k

Entering edit mode

8.7 years ago

alserg ▴ 1000

One of the things I was playing with to reduce redundant gene sets was bayesian-network-like filtering. There is example of it here:

	# eliminating indirect hits based on bayesian networks ideas

	library(fgsea)
	library(data.table)

	eliminatePathways <- function(universe, pathways, pathway2name, isNonRandomPval, pvalThreshold=1e-3) {
	messagef <- function(...) message(sprintf(...))
	pathways <- lapply(pathways, intersect, universe)
	mainPathways <- c()

	res <- list()

	pp <- names(pathway2name)[3] # for debug

	for (pp in names(pathway2name)) {
	messagef("Testing pathway '%s'", pathway2name[pp])

	if (length(mainPathways) == 0) {
	# nothing to compare with
	mainPathways <- pp
	next
	}

	interesting <- TRUE
	pp1 <- mainPathways[1] # for debug
	# We want our new pathway to give us additional information compared to all others that we have already
	for (pp1 in mainPathways) {
	notExplainsPval <- min(isNonRandomPval(universe = pathways[[pp1]],
	p = pathways[[pp]]),
	isNonRandomPval(universe = setdiff(universe, pathways[[pp1]]),
	p= pathways[[pp]]))
	res <- c(res, list(data.frame(
	checking=pp,
	reference=pp1,
	pval=notExplainsPval)))

	if (notExplainsPval > pvalThreshold) {

	messagef("Not adding pathway '%s', because we already have '%s'",
	pathway2name[pp],
	pathway2name[pp1])

	eqPval <- min(
	isNonRandomPval(universe = setdiff(universe, pathways[[pp]]),
	pathways[[pp1]]),
	isNonRandomPval(universe = pathways[[pp]],
	pathways[[pp1]]))

	res <- c(res, list(data.frame(
	checking=pp1,
	reference=pp,
	pval=eqPval)))

	if (eqPval <= pvalThreshold) {
	messagef("And they are not equivalent: %s", eqPval)
	} else {
	messagef("They are equivalent: %s", eqPval)
	}
	interesting = FALSE
	break
	}
	}

	if (interesting) {
	messagef("Adding pathway '%s', because we can",
	pathway2name[pp])
	mainPathways <- c(mainPathways, pp)

	# OK it gives use new info, may be now we don't need some of already added pathways?

	# Now roles have changed, we test every interesting pathways agains out newly added one
	for (pp1 in setdiff(mainPathways, pp)) {
	notExplainsPval <- min(
	isNonRandomPval(universe = setdiff(universe, pathways[[pp]]),
	pathways[[pp1]]),
	isNonRandomPval(universe = pathways[[pp]],
	pathways[[pp1]]))
	res <- c(res, list(data.frame(
	checking=pp1,
	reference=pp,
	pval=notExplainsPval)))
	if (notExplainsPval > pvalThreshold) {
	messagef("Removing pathway %s, because we now have %s",
	pathway2name[pp1],
	pathway2name[pp])
	mainPathways <- setdiff(mainPathways, pp1)
	}
	}
	}
	}

	list(mainPathways=mainPathways,
	table=rbindlist(res))
	}

	fgseaIsNonRandomPval <- function(pathways, ranks, nperm) {
	isNonRandomPval <- function(universe, p) {
	pathway <- intersect(p, universe)
	if (length(pathway) == 0 \|\| length(pathway) == length(universe)) {
	return(1)
	}
	fgsea(pathways = list(pathway),
	stats=ranks[names(ranks) %in% universe],
	nperm = nperm)$pval
	}
	}


	data("exampleRanks")
	data("examplePathways")

	fgseaRes <- fgsea(pathways = examplePathways,
	stats = exampleRanks,
	minSize=15,
	maxSize=500,
	nperm=10000)

	# checking pathways in the order of decreasing significance
	pathway2name <- fgseaRes[padj < 1e-2][order(pval), setNames(pathway, pathway)]

	# this process is slow, takes ~5 minutes
	date()
	elimRes <- eliminatePathways(universe=names(exampleRanks),
	pathways=examplePathways,
	pathway2name=pathway2name,
	isNonRandomPval=fgseaIsNonRandomPval(examplePathways, exampleRanks, nperm=5000))
	date()

	elimRes$mainPathways

	# should display something like:
	# 5990980_Cell_Cycle
	# 5991023_Metabolism_of_carbohydrates
	# 5991130_Programmed_Cell_Death
	# 5991851_Mitotic_Prometaphase
	# 5991078_Metabolism_of_nucleotides
	# 5991662_Cholesterol_biosynthesis
	# 5992313_Chromatin_modifying_enzymes

	# elimeRes$table contains results of testing pathways against other pathways (~log)

view raw gsea_elim.R hosted with ❤ by GitHub

The idea is the following. Let's consider two enriched overlapping pathways p1 and p2. First, let's make a hypothesis that p1 is truely enriched and p2 is just piggybacked to it because of the overlap. You can test this hypothesis by looking at genes unique to p2, that is setdiff(p2, p1). If for these genes you also have enrichment, that the hypothesis is false and you better keep p2. You can also check the other way, whether p1 have some unique enrichment compared to p2. By repeating this operation you can come up with a list of uniquely enriched pathways.

This not only removes redundant pathways, but also it will leave pathways at the most enriched level, which I found useful. However, there several arbitrary thresholds for p-values involved, so one need to be a little careful with interpretation. Otherwise it worked pretty well for me.

ADD COMMENT • link 8.7 years ago by alserg ▴ 1000

Entering edit mode

Hi,

I have a similar issue as in the original question: RNAseq data where a sample has a lot of very significant changes (around 10,000 genes) that I analysed with GSEA only to get hundreds of gene sets significantly enriched. However, I have the feeling most of them are just redundant as names are very similar and they often share a large number of genes. So here I am, looking for a way to trim redundant gene sets.

I tried your script but unfortunately I can't seem to make it work. I'm still quite a beginner with R so please bear with me. When I try to run the script as it is (i.e. using the example files in the fgsea package) I get the following error:

> elimRes <- eliminatePathways(universe=names(exampleRanks),
+                              pathways=examplePathways,
+                              pathway2name=pathway2name,
+                              isNonRandomPval=fgseaIsNonRandomPval(examplePathways, exampleRanks, nperm=5000))
Testing pathway '5990980_Cell_Cycle'
Testing pathway '5990979_Cell_Cycle,_Mitotic'
Error in colnamesInt(x, neworder, check_dups = TRUE) : 
  argument specifying columns specify non existing column(s): cols[1]='pathway'

I have searched and looked everywhere and tried a few things but don't understand where the issue is. Would you be able to help?

Also I have tried to run the script with my own .rnk and .gmt files. In that case I get an error already at the GSEA results:

> fgseaRes <- fgsea(pathways = examplePathways, 
+                   stats = exampleRanks,
+                   minSize=15,
+                   maxSize=500,
+                   nperm=10000)
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) : 
  undefined columns selected
In addition: Warning message:
In fgsea(pathways = examplePathways, stats = exampleRanks, minSize = 15,  :
  There are ties in the preranked stats (50% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

Any suggestions for a way around this?

Thanks

ADD REPLY • link 5.2 years ago by MarcoL • 0