Question

GSEA Preranked analysis on RNA-SEQ

4

Entering edit mode

10.3 years ago

Ron ★ 1.2k

Hi all,

I have done GSEA pre-ranked on differentially expressed genes between tumors and normals. The genes have been ordered by log fold change since GSEA pre-ranked requires an ordered list of genes.

The results are outputted in the format with phenotypes as na_pos and na_neg. I am not sure about these phenotypes as how does it differentiate between the 2 phenotypes based on an ordered list of genes.

I understand that the normal GSEA when run on tumors and normals expression values gives us the output between these two phenotypes, but not sure what does the phenotype mean in GSEA pre-ranked.

The fold change values are both positive and negative which are an input along with the gene symbols to GSEA pre-ranked in my case.

-Ron

GSEA differential-expression next-gen rna-seq • 19k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Ron ★ 1.2k

2

Entering edit mode

The direction of gene expression is critical. It is very likely that you have different gene sets in wither direction.

Ranking by significance is better than fold change; just think about those lowly expressed genes with extreme fold changes and high p-values. Checkout this NAR which discusses the difference.

http://nar.oxfordjournals.org/content/38/17/e169.long

You can generate a rank file with a simple awk script.

http://genomespot.blogspot.com.au/2015/01/how-to-generate-rank-file-from-gene.html

ADD REPLY • link 10.3 years ago by mark.ziemann ★ 2.0k

2

Entering edit mode

I am wondering if ranking genes by p-values gives rise to another nasty bias: genes with higher read counts (either because they are large or more highly expressed) yield lower p-values, simply because any statistical test will have more power with a larger number of reads. This would lead to artificially high enrichments of gene sets containing either large or highly expressed genes (or both). From my own GSEA results using p-value pre-ranked gene lists, I think that I indeed observe this trend, although I do not have hard data yet.

Thus, both solutions -- ranking by fold change and by p-value -- are probably not perfect. Any suggestions to do better?

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by Christian ★ 3.1k

1

Entering edit mode

I agree these options are both not ideal, but significance based ranking at least has some statistical basis. Fold change is too susceptible to noise for lowly expressed genes. Another approach could be to rank based on the lower confidence interval of the fold change. These all need to be baked off IMO.

ADD REPLY • link 9.6 years ago by mark.ziemann ★ 2.0k

0

Entering edit mode

That sounds worth a try. How would you compute CI for RNA-seq fold changes? I have not seen them in the output of e.g. EdgeR or DESeq2.

ADD REPLY • link 9.6 years ago by Christian ★ 3.1k

0

Entering edit mode

Some thoughts from Gordon Smyth on the issue. Very useful.

https://support.bioconductor.org/p/61640/

ADD REPLY • link 9.6 years ago by mark.ziemann ★ 2.0k

Ram · Answer 1 · 2015-04-02

Actually it does not know which is what phenotype, since we do not provide cls file as one would do during normal GSEA, hence phenotype label is na. But, as you have mentioned, based on log fold changes it assumes those genes with positive fold changes are phenotype 1 (na_pos) and those with negatives are phenotype 2 (na_neg). (I think you can change this by reversing the rank order, I am not sure)