Question: Combining GSEA (Gene Set Enrichment analysis) and DEG (Differentially Expressed Genes) to confirm results together- is it a good idea?
gravatar for chokevin8
9 weeks ago by
chokevin80 wrote:

Hi, I'm trying to start a project based on R where I input cancer patient data to find DEG's to ultimately search for possible pharmaceutical targets. My focus is can I can input the same data into GSEA and DEG to confirm each other's conclusions. Right now, I'm only using DEG (voom+limma package in R) to filter/select significant genes.

I know that these two analyses are completely different- GSEA takes in a priori gene sets and gives information relevant to significant gene SETS for each phenotype. DEG will look into individual GENES (not gene sets) and gives us a list of differentially expressed genes for each phenotype.

However, I was wondering if these can work together in harmony so that we can first use GSEA to filter significant gene sets and then use DEG to test individual genes significantly enriched in those gene sets of GSEA. I thought this would help because just performing DEG inherently lacks biological significance. But while GSEA has biological significance, it doesn't have the ability to detect at the level of individual genes. So why not make them work together to complement each other's strengths/weaknesses?

For example, I would run GSEA for two different cancer types (phenotype) A and B, and find gene set X is overexpressed. Then I would look into which group of individual genes are contributing the most to the enrichment score for gene set X. Then I would run a DEG analysis of those individual genes. If I find some genes that are significantly overexpressed for specific types of cancers, that actually itself can be a probable target.

I also do recognize the difficulty of running this together- there are so many different packages that have different methods (ex. normalization methods, etc). But putting these problems aside, I'm just asking that if I could get this right, would this be a good idea?

Thank you for your input :)

sequencing rna-seq R • 339 views
ADD COMMENTlink modified 9 weeks ago by miky.zo40 • written 9 weeks ago by chokevin80
gravatar for miky.zo
9 weeks ago by
Italy / Busto Arsizio / University of Insubria
miky.zo40 wrote:

Hi @chokevin8, I think that there is a misunderstanding in the methods. You use the GSEA method to analyze your DEGs list (or at least, is what I understood GSEA exists for).

So, you can first find your DEG list, with gene name/symbol/ID, pvalue and log FC. Then you use this list to run a GSEA. Also, it's better to run the GSEA on ALL your genes, not only over-expressed/under-expressed.

What a GSEA do is to rank your genes based on a certain value that you provide; usually, this value is the logFC of the genes, but sometimes I saw even calculations like: pvalue * logFC, in this way you also take into consideration the significance of the gene, even if GSEA doesn't care of it! ;)

Now, imagine that you have a DEG list with logFC, you load into GSEA program ( , from BROAD instute) or online in a website like Enrichr ( ). GSEA basically take the ranked genes (from + to -, according to logFC) and confront them with the gene sets specified.

Then, this is VERY important, the result it is not that a specific pathway is up- or down-regulated, but the fact that the pathway is affected in some way by the condition that you re studying. In fact, you will have enriched genes both up and down-regulated. The result of GSEA is a broad picture of what's going on in your cell line / model.

I hope that this helps you!

ADD COMMENTlink written 9 weeks ago by miky.zo40

the result it is not that a specific pathway is up- or down-regulated

You should clarify this part. The result is actually an enrichment score with a specific direction (up or down). Not all genes are in the same direction, but there should be en enrichment at one end of the spectrum (see also: "leading-edge genes").

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by igor11k

Hey, thanks for your kind input :) I do understand your method and why you would suggest one like that. However, don't you think both would work? GSEA first then DEG/DEG first then GSEA? But the reason why I thought GSEA first then DEG would work better is because if you do DEG first for tens of thousands of individual genes, then it is simply too inefficient- GSEA would help reduce dimensionality for the subsequent DEG test. Though, it would be interesting to see the differences of results of DEG first then GSEA vs GSEA first then DEG...

Also, when you say the result of a GSEA is not that a specific pathway is up- or down-regulated, you're basically saying that the reason why we do GSEA is just to see which pathway (gene sets) is affected by the phenotype, right? So basically up-regulation and down-regulation isn't significant in GSEA...

ADD REPLYlink written 9 weeks ago by chokevin80

There isn't any problem on the dimensionality of DEG test, as more genes lead to better estimation of the parameters of the genes. How do you do GSEA before having an ordered list?

Yes usually GSEA methods do not evaluate if a pathway is up or down-regulated (how would it know? )

ADD REPLYlink modified 8 weeks ago • written 9 weeks ago by Lluís R.910

Hi! I don't understand how you would do a GSEA without having a list of Differentially Expressed Genes... What would you insert as input in the analysis? The GSEA starts with a list of DEGs, so in any case you need to do it beforre running GSEA. So, you do DEG analysis and find genes up regulated and down regulated in tumor VS normal. Now, you have a list of DEGs that you can analyse in different ways.

One for example is to decide the cutoff for pvalue and logFC to define what is really DE in the two conditions, imagine pvalue < 0,05 and logFC > |1,5|. In this way, you find genes that you can further analyze with an enrichment analysis such as GO pathway or KEGG on only upregulated genes for example (or down), in this way you can find pathways that move in the way of your genes (+ or -).

With GSEA you use all the genes as I said before and you obtain a list of pathways/biological processes in which your list of genes is involved, based on the ranking provided. In these lists you can find both up and downregulated genes, because as you know a pathway is composed by many components. So, GSEA is a general picture of what's going on.

For your aim, both methods can be good. You can find gene X, very important target to block that is one of the top of your DEG analysis; but also you can find that "epithelial to mesenchimal transition" is enriched in your GSEA analysis (based on the same DEG list), so you can pick one in the many genes involved as a target to block the entire pathway, instead of only few genes. ;)

ADD REPLYlink written 8 weeks ago by miky.zo40

I don't understand how you would do a GSEA without having a list of Differentially Expressed Genes... What would you insert as input in the analysis? The GSEA starts with a list of DEGs

It is not recommended to only use differentially expressed genes fro GSEA. See previous discussions:

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by igor11k

You're right, sorry, I wrote something that can create confusion: with DEG list I mean the list that you obtain after you analyze your two conditions in the RNAseq/microarray, so technically they are just the genes that come out from analysis with pval and logFC. :)

ADD REPLYlink written 8 weeks ago by miky.zo40

Thank you everyone for the input, and after reading igor's comment, then I'm guessing I should use DEG package in R (DESeq2, limma, etc) and then use GSEA-pre ranked, is that right?

ADD REPLYlink written 8 weeks ago by chokevin80

That would make sense.

ADD REPLYlink written 8 weeks ago by igor11k

How about using pathway analysis using R package "Rontotools"? Would that be a better idea since this would provide actual biological significance and is more accurate than gene set analysis (DEG, GSEA)?

ADD REPLYlink written 8 weeks ago by chokevin80
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 609 users visited in the last hour