Question: GSEA Preranked analysis on RNA-SEQ
gravatar for Ron
4.8 years ago by
United States
Ron990 wrote:

Hi all,



I have done GSEA pre-ranked on differentially expressed genes between tumors and normals.The genes have been ordered by log fold change since GSEA pre-ranked requires an ordered list of genes.

The results are outputted in the format with phenotypes as na_pos and na_neg.I am not sure about these phenotypes as how does it differentiate between the 2 phenotypes based on an ordered list of genes.

I understand that the normal GSEA when run on tumors and normals expression values gives us the output between these two phenotypes,but not sure what does the phenotype mean in GSEA pre-ranked.

The fold change values are both positive and negative which are an input along with the gene symbols to GSEA pre-ranked in my case.





ADD COMMENTlink modified 4.8 years ago by poisonAlien2.8k • written 4.8 years ago by Ron990

The direction of gene expression is critical. It is very likely that you have different gene sets in wither direction.

Ranking by significance is better than fold change; just think about those lowly expressed genes with extreme fold changes and high p-values. Checkout this NAR which discusses the difference.

You can generate a rank file with a simple awk script.

ADD REPLYlink written 4.8 years ago by mark.ziemann1.2k

I am wondering if ranking genes by p-values gives rise to another nasty bias: genes with higher read counts (either because they are large or more highly expressed) yield lower p-values, simply because any statistical test will have more power with a larger number of reads. This would lead to artificially high enrichments of gene sets containing either large or highly expressed genes (or both). From my own GSEA results using p-value pre-ranked gene lists, I think that I indeed observe this trend, although I do not have hard data yet.

Thus, both solutions -- ranking by fold change and by p-value -- are probably not perfect. Any suggestions to do better?

ADD REPLYlink modified 6 weeks ago by RamRS25k • written 4.2 years ago by Christian2.8k

I agree these options are both not ideal, but significance based ranking at least has some statistical basis. Fold change is too susceptible to noise for lowly expressed genes. Another approach could be to rank based on the lower confidence interval of the fold change. These all need to be baked off IMO.

ADD REPLYlink written 4.1 years ago by mark.ziemann1.2k

That sounds worth a try. How would you compute CI for RNA-seq fold changes? I have not seen them in the output of e.g. EdgeR or DESeq2.

ADD REPLYlink written 4.1 years ago by Christian2.8k

Some thoughts from Gordon Smyth on the issue. Very useful.

ADD REPLYlink written 4.1 years ago by mark.ziemann1.2k
gravatar for poisonAlien
4.8 years ago by
poisonAlien2.8k wrote:

Actually it does not know which is what phenotype, since we do not provide cls file as one would do during normal GSEA, hence phenotype label is na. But, as you have mentioned, based on log fold changes it assumes those genes with positive fold changes are phenotype 1(na_pos) and those with negatives are phenotype 2 (na_neg). (I think you can change this by reversing the rank order, I am not sure) 

ADD COMMENTlink written 4.8 years ago by poisonAlien2.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1204 users visited in the last hour