What is the best way to rank genes for GSEA?
1
9
Entering edit mode
5.1 years ago
Gabriel ▴ 170

I am doing pathway and gene ontology analysis using Gene Set Enrichment Analysis(GSEA). For the tools, you need to provide a ranked gene list, however, various papers have provide different recommendations on how to do this.

Is there a current consensus on what is the ideal way to do this? I've been using Log2 Fold change, and I am unsure weather to use Fold Change, p-values instead. Or an other method?

One post: Problem with creating GSEA rank file recommended signed p-values, but I haven't found any literature reviews or clarification on the issue. clusterProfiler mentions fold change for ranked gene lists, so I am unsure if I would be getting "bad results" by using p-value sorting. And if the different packages are optimized for one or the other sorting.

According to Yu, author of cluster profiler:

geneList contains three features: numeric vector: fold change or other type of numerical variable named vector: every number has a name, the corresponding gene ID sorted vector: number should be sorted in decreasing order https://github.com/GuangchuangYu/DOSE/wiki/how-to-prepare-your-own-geneList

"other type of numerical variable" is unclear. Perhaps there are other, similar methods to GSEA who have a more concrete way of doing things?

EDIT: for clusterProfiler's function gseGO() I get different result when using Log2FoldChange versus FoldChange for ranking

GSEA GO Gene ranking RNA-Seq • 19k views
ADD COMMENT
13
Entering edit mode
5.1 years ago
Pietro ▴ 230

Hi Gabriel

For GSEA, some they do signed fold change * -log10pvalue, found it here: http://crazyhottommy.blogspot.com/2016/08/gene-set-enrichment-analysis-gsea.html

ADD COMMENT
2
Entering edit mode

Just because you see something in published papers doesn't mean it's good or recommended, a lot of authors miss things or do not have a deep knowledge of what they are doing, and such technical details are often not reviewed by peer reviewers, even in high impact papers.

Using only logFC or only p-value based ranking metrics (which includes the above approach since only using the logFC to get the direction) each have their downsides - genes ranked by logFC are biased by the bigger variance in genes with low counts and genes ranked by p-value are biased by genes with higher abundance and longer transcripts. See https://support.bioconductor.org/p/85681/

ADD REPLY

Login before adding your answer.

Traffic: 2616 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6