Question: How many genes to include for a GSEA analysis
gravatar for moxu
3.8 years ago by
moxu470 wrote:

This seems to be a simple question.

You have a list of genes from DEG analysis, with p-values, FDRs, & logFCs, etc. Previously, what I do for GSEA analysis is to filter in genes with FDR < 0.25 or 0.05, rank them by logFC (in other words, pre-rank the genes by logFC), and then do GSEA. Now I am wondering if this is a good way:

  • There might be too many genes (typically ~50%). Assuming usually
    there are 4~5 pathways involved and each pathway has about 500 genes, then the top 2,000 genes might be enough to be included for GSEA
  • Not sure if logFC is the best way to rank genes. Maybe
    use -log(PValue) as the magnitude of the rank score and the sign of logFC as the sign of the sore? i.e., use sign(logFC) * (-log(PValue)) as the rank score?

Googled briefly but didn't find a convention.


rna-seq next-gen gene • 5.6k views
ADD COMMENTlink modified 3.8 years ago by igor12k • written 3.8 years ago by moxu470

Your first point is asking about a good threshold or filter for your gene list. Typically, this would depend on what you're interested in. For example, you could be interested in only the strongest effects and therefore take only the most extreme logFC. I could also imagine situations in which you are only interested in certain categories of genes, maybe because you have some prior knowledge. On the second point, you have to consider what the parameter used for ranking represents: logFC represents the strength of the effect while log(p-value) represents "unexpectedness". To me, effect strength is more relevant than p-value because, without any other information, I wouldn't trust a small variation even if it is associated with a small p-value. Another way of putting it is that statistical significance doesn't imply biological relevance but a strong effect is likely to have some biological impact.

ADD REPLYlink written 3.8 years ago by Jean-Karim Heriche24k

Please see my reply below to igor -- one of my experiences is that "true signals" (low p-values) should be weighted much more than "big signals" (large abs(logFC)s).

ADD REPLYlink written 3.8 years ago by moxu470
gravatar for igor
3.8 years ago by
United States
igor12k wrote:

According to the GSEA documentation:

The GSEA algorithm does not filter the expression dataset and does not benefit from your filtering of the expression dataset. During the analysis, genes that are poorly expressed or that have low variance across the dataset populate the middle of the ranked gene list and the use of a weighted statistic ensures that they do not contribute to a positive enrichment score. By removing such genes from your dataset, you may actually reduce the power of the statistic.

And additionally in the wiki:

We hopefully will be able to devote some time to investigating this, but in the mean time, we are recommending use of the GSEAPreranked tool for conducting gene set enrichment analysis of data derived from RNA-seq experiments. In particular: Prior to conducting gene set enrichment analysis, conduct your differential expression analysis using any of the tools developed by the bioinformatics community (e.g., cuffdiff, edgeR, DESeq, etc). Based on your differential expression analysis, rank your features and capture your ranking in an RNK-formatted file. The ranking metric can be whatever measure of differential expression you choose from the output of your selected DE tool. For example, cuffdiff provides the (base 2) log of the fold change.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by igor12k

The reason I created this post is that my recent experience hinted the -log(P) is more trust worthy than logFC. One of my projects is to find the differentially regulated pathways by a certain compound for a drug-resistant cell line. We treated the sensitive & resistant cell lines with two different compounds at two different concentrations besides DMSO. GSEA found the same up-regulated pathway for the two compounds: TNFA signaling pathway via NFKB. This makes a lot of sense to us -- and maybe to you as well because TNFA & NFKB are famous tumor related genes. What is interesting is that the genes of this pathway occur frequently with extremely low p-values (7 out of 20). While those with extreme logFCs don't have a single gene of this pathway (0/20) even after FDR < 0.05 filtering. Genes with extreme logFCs usually have relatively high p-values. This puts me think that maybe we should weigh "true signals" much more than "big signals".

ADD REPLYlink written 3.8 years ago by moxu470

It depends on how you are calculating fold changes. If you have an extreme outlier, it can push the fold change up or down by a lot. DESeq2, for example, performs "shrinkage" of the fold change to account for variance. In a way, the fold change has the significance built in in that case. I am not sure which other packages perform the same type of adjustment.

ADD REPLYlink written 3.8 years ago by igor12k

I used edgeR, which does bayes shrinkage as well. But, still ...

Too bad this website does not host images, otherwise I would be happy to upload some to demonstate.

ADD REPLYlink written 3.8 years ago by moxu470

actually it does hosts images.

ADD REPLYlink modified 20 months ago • written 20 months ago by Ă–mer An220
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1843 users visited in the last hour