Question

topGO which statistical test (fisher or KS) to use ?

1

Entering edit mode

7.0 years ago

mary ▴ 10

Dear Everyone,

I am using differential expressed genes from a RNA dataset test performed on EdgeR. From my "gene universe" (=circa 800 genes) I picked up the most significant up-regulated one (circa 300 for the 1% threshold) that I consider as "geneList" object for the enrichment analysis with topGO. For a vast majority of the "gene universe" correspond from 1 to 8 GO terms.

Now that the topGOdata class is set, I am struggling with the choice of the statistics family to use either KS or Fisher): I have gone through the topGo manual for it so many times that I know the related paragraphs almost by heart... And visited all potential blogs related to it.

If I resumes, as I understand, I should use my test among following (from p.5 of the Manual): - Fisher test: for count data - KS test: for modified data (e.g. p-values)

So I conclude that with my dataset, it seemed logical to use the ks test. But in all literature with the same kind of dataset, people publish results of the fisher test.

The statistical test to use with topGO is still very cryptic to me. I am probably not aware enough of how the different tests are running and producing the outputs.

Could anyone give me some advise on how to choose the stats with topGO and/or to set up the data (geneList) in accordance to it ?

Thank you very much for any comment or link to clarify this !

p.s. more technically and detailed: my "geneList" data is the logFC- pruned FDR values output from DE analysis with EdgeR. From this I partition the down- form the up-regulated most significant (=FDR) genes. And I want to perform the GO enrichment to each partitioned subsets.

RNA-Seq topGO statistics geneList • 9.1k views

ADD COMMENT • link updated 6.9 years ago by nitsuaq ▴ 110 • written 7.0 years ago by mary ▴ 10

0

Entering edit mode

You use Fisher's exact test on a contingency table i.e. a table of counts which is normally what you get when doing a GO terms analysis. The Kolmogorov-Smirnov test compares two distributions. What makes you think that your data is unsuitable for an enrichment test using Fisher's exact test ?

ADD REPLY • link 7.0 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Most of the tools do Fisher's exact (hypergeometric) test to assign GO term.

ADD REPLY • link 6.9 years ago by cpad0112 21k

score 8 · Answer 1 · 2017-06-03

First, is there a reason your gene universe is so small? Bigger gene universes make enrichment analyses more stable and reproducible. Depending on where you are getting your data, I would say 800 genes is very low for a gene universe. For human tissue I expect at least several thousand, otherwise I start questioning the quality of the data.

Fisher and ks are just two ways of answering the same question: are the most significant genes enriched for any particular GO term annotations?

Fisher's exact test compares the expected number of significant genes at random to the observed number of significant genes to arrive at a probability.

The KS test compares the distribution of gene p-values expected at random to the observed distribution of the gene p-values to arrive at a probability. KS is theoretically the better choice because it does not require an arbitrary p-value threshold.

Based of of my most recent project, however, the fisher test with p<0.01 and weight01 algorithm seemed to identify the informative GO terms, whereas KS and weight01 tended to identify very basic GO terms like biological process, or cellular process. This could be particular to my dataset, so I would suggest trying both and seeing which gives you more informative GO terms.