Question: Need help choosing genes and calculating ranks for GSEA rank file
2
gravatar for zlikowski
9 weeks ago by
zlikowski20
zlikowski20 wrote:

Dear all,

I am trying to do my first pathway enrichment analysis of a ranked gene list using GSEA, as described in the relatively recent protocol published here: https://www.nature.com/articles/s41596-018-0103-9. I have "successfully" completed the entire protocol, meaning that I have, more or less, learned how to perform the DE analysis (using DESeq2), how to format the rank, class, gmt and expression files for GSEA, and use other tools required by the protocol. However, I have put quotations marks around the word "successfully", because I am not sure whether I have used the correct gene set when preparing rank files for GSEA. Here's what's bothering me:

1) As I understand, an input rank file for running GSEA in preranked mode should contain gene IDs in one column, and gene ranks in the second column. However, I am not certain which genes exactly should such a file contain. Before running the DESeq2 functions, I usually pre-filter low count genes, as suggested in the vignette, but the above-mentioned protocol states that "...all (or most) genes in the given genome need to have a score. What does the term "all genes" refers to in the context of GSEA analysis, then? And does that mean that, if I want to run GSEA, I should not pre-filter (remove) low counts genes before running DESeq2 and obtaining the results of DE analysis? So, briefly, which genes should be included in the rank file for properly running GSEA in preranked mode - all annotated genes for a given genome, DESeq2 pre-filtered genes, only genes for which padj was calculated during DE analysis, or something else?

2) Next, once I do obtain the results of DE analysis, with the proper set of genes for GSEA, a certain number of genes will contain padj="NA" value. Since "NA" cannot be used to calculate rank, my next question is how to deal with those genes? Should I remove them before running GSEA (which might be in conflict with the "all genes requirement"), or should I change their padj value to some number, e.g. 1?

3) Finally, what to do when certain number of genes have padj=0. This also complicates rank calculation, and I was wondering whether it would be "fair" to change the padj values for those genes to the smallest non-zero value (which one?) that can be used for calculations on a "standard" 64-bit computer?

Thanks everyone in advance, help is greatly appreciated!

Mirko

ADD COMMENTlink modified 9 weeks ago by alserg250 • written 9 weeks ago by zlikowski20
0
gravatar for alserg
9 weeks ago by
alserg250
alserg250 wrote:

1) I recommend to consider "all genes" as all expressed genes, considering only top 10-12K most expressed genes (by average expression, or baseMean, if it's a result of DESeq2). You can use all the genes as well, but the list will contain a lot of noise.

As for 2) and 3) you can use stat field of the results table for ranking, it is always present (for expressed genes).

Additionally, I would recommend to use fgsea package, it's much faster than Broad's version and compatible with R. See for example: http://bioconductor.org/packages/release/bioc/vignettes/fgsea/inst/doc/fgsea-tutorial.html and https://stephenturner.github.io/deseq-to-fgsea/

ADD COMMENTlink written 9 weeks ago by alserg250

Hello,

my apologies for not responding sooner, I was out of the country. Thank you very much for your suggestions and help. I will try to do GSEA using stat and some other ranking criteria to see how much of a difference that makes, and post the results here once I'm done.

ADD REPLYlink written 7 weeks ago by zlikowski20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1979 users visited in the last hour