Question

GSEA preranked with DESeq2 in a RNA-seq

2

Entering edit mode

3.5 years ago

Rafael Soler ★ 1.2k

Hi!

I am trying to perform a GSEA preranked analysis from a paired RNAseq analysis, and I have 2 questions:

When the DESeq2 analysis is performed, a lot of genes include NAs values inside the dataframe, in the columns of Log2FC, padj, etc. To perform the GSEA, we have to use ALL the genes, and I think that is obvious that I have to eliminate the NAs values from Log2FC (if it is the value of ranking the list), but what happen with the NAs values in padj genes?? They have their own Log2FC value (altough the padj is NAs). Should I remove them? Or put them all in the analysis?
Which value of the DESeq2 results should I use to prerank the genes? The Log2FC? The stat?

Thanks a lot!! :D

DESeq2 GSEA RNA-seq Preranked • 1.8k views

ADD COMMENT • link updated 3.5 years ago by ATpoint 81k • written 3.5 years ago by Rafael Soler ★ 1.2k

score 1 · Answer 1 · 2020-10-07

I personally rank by -log10(p) * logFC, where p is the nominal ("raw") p-value and the logFC is simply the fold change. That will give you positive values for FCs > 0 and vice versa. The advantage of raw p-values over padj is that it contains fewer ties (e.g. the many NAs or 1s) after the independent filtering or if power is low. You do not want ties since the GSEA methodology relies on a continuous ranking of genes. You can also rank by logFC alone, e.g. after using lfcShrink. I do not use logFC since I use edgeR for DE analysis and it does not explicitely offer fold change shrinkage to correct the logFCs, therefore you get large FCs when counts are low. I therefore use the p-value to somewhat correct for this (p for large FCs at low counts are often high), so somewhat penalizing these large / unreliable FCs. I also have seen others using the stat column of DESeq2::results(), the F statistics column etc., depending on what the tool you use outputs. Technically it must be something that assigns ranks to each genes sorted by fold change direction and "significance" where "significance" is something that can be defined by the user, logFC, p-values, or anything else, whatever you find reasonable in the context of your experiment.