Hey y'all
I'm currently trying to run GSEA using a pre-ranked gene list but I'm not sure if my input file is correctly formatted, because my results seem to be mostly insignificant.
So my input looks something like this (roughly 16,000 genes):
Where my ranking statistic is the negative log of the p-Value obtained through an association test.
GENE neg_log_Chi_permutation
ARHGAP4 0.928986
C16orf3 1.496821
HOPX 0.975562
FAM3D 1.132781
HTR2C 1.276158
UGCG 0.064802
VPS13D 0.123508
VWF 0
My results have over 300 gene sets shown to be enriched, but many of them have a FDR p value of close to 1, with a high NES value. What could I be doing wrong?
Best,
Steven
In the original paper describe GSEA, you use FDR less than 0.25 actually. Additionally, pay attention to p=0 when you -log(p).
I do have several instances of p=1, and subsequently instances where genes are ranked the same. How should that be reconciled?
Just recognised an error in your methods. You should combine the sign of logFC and -log of the p-Value (you ranked both up and down DE genes in the top of the rnk file). Or rank based on other metric, like logFC, t statistic. Additionally, GSEA use all the genes not DE genes.
Hi Zhilong,
Regarding "pay attention to p=0 when you -log(p)", should those genes with p=1 be filtered out prior to the GSEA analysis?
I suggest that all the genes /probes should be included for GSEA input.