Question: Fisher's exact test gives p-value 0
0
4.5 years ago by

Hello,

I have a similar situation described in this post Hypergeometric Test On Gene Set

I have 2 microarrays on 2 different conditions which give me 2 different gene sets of differential expressed transcripts.

Diff in Condition 1: 738

Diff in Condition 2: 1090

Overlap Condition 1 & 2: 453

Total Genes in array: 30941

I want to test the significance of the overlap between the 2 conditions. I use:

phyper(452, 738, 30203, 1090, lower.tail=FALSE)

[1] 0

Any idea why the p-value is 0? I tried based on this post "http://stats.stackexchange.com/questions/16247/calculating-the-probability-of-gene-list-overlap-between-an-rna-seq-and-a-chip-c"

phyper=(overlap,list1,PopSize-list1,list2,lower.tail = FALSE)

Thanks

written 4.5 years ago by Adrian Pelin2.4k

You should try using log=TRUE

I get:

phyper(452, 738, 30203, 1090, lower.tail=FALSE, log.p = TRUE) [1] -1140.21

Any idea what what means? p.value = 1E-1140 ?

e^-1140.21, since log is natural log here.

That number is still 0 when using any calculator. My question is, why is the p-value so low? The overlap is not that great, it is ~50-70% of genes. Is the 2x2 table constructed correctly?

5

You're calculating the probability of the following scenario:

• You have a jar of 30203 black balls and 738 white balls
• You draw 1090 of them randomly without replacement
• You count the number of white balls you have drawn and it is equal to 452
• The probability of drawing greater than 452 white balls given your conditions is virtually zero
• Inversely, the probability of drawing fewer than 452 white balls given your conditions is virtually one

In a jar where ~ 2% of the balls are white, it would be extraordinarily rare to draw 50-70% of them being white by chance alone, which is why your p-value is so low.

1

The overlap is not that great, it is ~50-70% of genes

That's why I think p-values in genomics are often meaningless. You get very small p-values even if the effect size is small and this is a consequence of the large of data-sets available (thousands of genes, millions of SNPs etc.). By the way, I wouldn't say ~50-70% is a small overlap...