Question: GSEA PreRanked lists from DESeq2 results table
gravatar for Assa Yeroslaviz
7 weeks ago by
Assa Yeroslaviz1.3k
Assa Yeroslaviz1.3k wrote:

I have several tables of results from different DESeq2 runs. The next step would be to do GO enrichment or GSEA enrichment analysis.

For that I would like to create a ranked list of genes for GSEAPreRanked. But I'm not sure which value to take for the ranking. Do I use the log2FC values or the p-values, or even the adjusted p-values.

I have searched in different foren and the opinions varied.

When I use this command sign(resultsObject$log2FoldChange) * -log10(resultsObject$padj) I get Inf, if the padj=0.

FOr the GO enrichment I can use the goseq package, for the gsea I wanted to use fgsea, which does need a ranked gene list.

Is it better to rank the list by significance (adj. p-values) or by expression intensity ( fold-change)?

I would appreciate your opinions and/or reccomendations

thanks, Assa

preranked gsea deseq2 fgsea • 296 views
ADD COMMENTlink modified 7 weeks ago by jomo018540 • written 7 weeks ago by Assa Yeroslaviz1.3k

I know it's very common, but I am personally a little worried about using p-values as the ranking. You can have very strong changes with high p-values and very subtle changes with low p-values.

There is a nice example here where they use the test statistic as the ranking, which is a nice strategy:

ADD REPLYlink written 7 weeks ago by igor9.8k

thanks for the link. it is a very god example.

ADD REPLYlink written 7 weeks ago by Assa Yeroslaviz1.3k

I'd recommend against using p-adjusted values; use the unadjusted p-values instead. The default FDR adjustment squashes genes to have the same adjusted p-value, despite having different input p-values. The distribution of logFC is different for genes with a different average expression level, this is why I tend to rank on the signed p-values rather than the FCs.

ADD REPLYlink written 7 weeks ago by russhh5.2k

Good point about the same adjusted p-values. On a related note, there will also be a lot of adjusted p-values that are 1. Other than that, the adjusted and unadjusted p-values will correlate, so the rank order will be the same.

ADD REPLYlink written 7 weeks ago by igor9.8k
gravatar for alserg
7 weeks ago by
alserg420 wrote:

Definetly don't do adjusted P-values. Signed log (nominal) P-value or statistic (stat column) should be fine. I personally use the latter, but I don't have any arguments for this. From my experience the results are very similar.

ADD COMMENTlink written 7 weeks ago by alserg420

This is exactly what I mean. Some people use this values, other use a different one, sometimes without any reason. Especially if the results are similar.

The advantage of using the FC values is, that I don't have any 0 in the table.

What do you do with them, if you convert to Signed log (nominal) P-value?

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Assa Yeroslaviz1.3k

Usually, there is no P-value of exactly one. But as I said, I prefer using the statistic, which is very straightforward.

ADD REPLYlink written 7 weeks ago by alserg420

Would you recommend removing genes with exactly 0 in the stat column? In my case I am using the F-stat column from edgeR::glmQLFTest which is zero for a small subset of genes (like 200 out of 17.000 that survival the FilterByExpr filter), so one would have ties. Or doesn't it matter? Would appreciate your comment. If you need more details please tell me.

ADD REPLYlink written 7 weeks ago by ATpoint31k

I'm not that familiar with edgeR pipeline. Isn't F-statistic a positive one, not signed? If so, it's a shady territory. The method will work: it will say whether or not gene set look uniformly distributed, but you should be careful with the interpretation.

In any case, don't remove gene based on statistic, even if it's zero. Only remove them on something uncorrelated, like average expression (that's what FilterByExpr does).

ADD REPLYlink written 7 weeks ago by alserg420

Thanks for the reply! Yes, F-stat is positive, so I would multiply with (-1) for negative FCs.

ADD REPLYlink written 7 weeks ago by ATpoint31k

thanks Alexey for the answers and the help. I'm following the fgsea package instruction now using the stat column for my ranking.

I still have one question though. If I do decide to use the pvalue column I still have some very significant genes, which the sign-log conversion turn the value into Inf. How should one handle these kind of data? Would changing the Inf into the value of 312 (making this a p-value of 10^312 be a possible solution?

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Assa Yeroslaviz1.3k

Yes, changing Inf to a big number should work fine.

ADD REPLYlink written 5 weeks ago by alserg420
gravatar for jomo018
7 weeks ago by
jomo018540 wrote:

I tend to view p-values and adjusted p-values as a confidence measures, not enrichment measures, therefore less fit for GSEA.

Results for genes with high (bad) p-value are simply not reliable and should not be used for further analysis. Once you determine a p-value threshold, I would argue that FC (or log2FC) is then the proper measure for GSEA.

ADD COMMENTlink written 7 weeks ago by jomo018540

If you are compiling a ranked list of genes, there should be genes there that are not significantly changing. One of the benefits for running a ranked list is to aggregate signal from many genes that are not necessarily significant on their own.

ADD REPLYlink written 7 weeks ago by igor9.8k

Yes, I realize that. However, if p-value is insignificant, I am not sure whether my estimate for that gene is correct. So I prefer discarding that gene altogether rather than placing it incorrectly in the ranked list.

ADD REPLYlink written 7 weeks ago by jomo018540
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1278 users visited in the last hour