I am trying to do my first pathway enrichment analysis of a ranked gene list using GSEA, as described in the relatively recent protocol published here: https://www.nature.com/articles/s41596-018-0103-9. I have "successfully" completed the entire protocol, meaning that I have, more or less, learned how to perform the DE analysis (using DESeq2), how to format the rank, class, gmt and expression files for GSEA, and use other tools required by the protocol. However, I have put quotations marks around the word "successfully", because I am not sure whether I have used the correct gene set when preparing rank files for GSEA. Here's what's bothering me:
1) As I understand, an input rank file for running GSEA in preranked mode should contain gene IDs in one column, and gene ranks in the second column. However, I am not certain which genes exactly should such a file contain. Before running the DESeq2 functions, I usually pre-filter low count genes, as suggested in the vignette, but the above-mentioned protocol states that "...all (or most) genes in the given genome need to have a score. What does the term "all genes" refers to in the context of GSEA analysis, then? And does that mean that, if I want to run GSEA, I should not pre-filter (remove) low counts genes before running DESeq2 and obtaining the results of DE analysis? So, briefly, which genes should be included in the rank file for properly running GSEA in preranked mode - all annotated genes for a given genome, DESeq2 pre-filtered genes, only genes for which padj was calculated during DE analysis, or something else?
2) Next, once I do obtain the results of DE analysis, with the proper set of genes for GSEA, a certain number of genes will contain padj="NA" value. Since "NA" cannot be used to calculate rank, my next question is how to deal with those genes? Should I remove them before running GSEA (which might be in conflict with the "all genes requirement"), or should I change their padj value to some number, e.g. 1?
3) Finally, what to do when certain number of genes have padj=0. This also complicates rank calculation, and I was wondering whether it would be "fair" to change the padj values for those genes to the smallest non-zero value (which one?) that can be used for calculations on a "standard" 64-bit computer?
Thanks everyone in advance, help is greatly appreciated!