Question

Extremely large difference between two sample sizes of Mann–Whitney U test

0

Entering edit mode

9.4 years ago

billzt ▴ 20

I need to compare the derived allele frequency spectrum of my studied mutations with the synonymous SNPs. The number of my studied mutations is very small, only 14, while the number of genome-wide synonymous SNPs is up to 1 million. Therefore these two sample size are largely different and directly applied Mann-Whitney U test to them is of course no significant difference. How do I know that the non-significant result is due to small sample size or due to that these two samples are truly no difference?

sample-size SNP Mann-Whitney-U-test • 6.1k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by billzt ▴ 20

Ram · Answer 1 · 2014-12-30

2

Entering edit mode

9.4 years ago

mikhail.shugay 3.5k

You could perform permutation testing: select 14 SNPs at random from those 1mln, say 10000 times and build the histogram of their allele frequency means / medians. The number of times you get a larger allele frequency divided by 10000 will be the P-value.

You can also build the allele frequency distribution for all your SNPs and see how large the allele frequencies of your 14 SNPs are in respect to it.

Visualizing allele frequency distributions could give you some insight on what is happening in your data.

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Thank you, I'll try it.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by billzt ▴ 20

Ram · Answer 2 · 2014-12-31

Therefore these two sample size are largely different and directly applied Mann-Whitney U test to them is of course no significant difference.

I don't think the lack of significant difference is due to the different sample sizes, why should it be the case? (In fact I don't see the point in resampling SNPs from the large set). Rather, the small dataset reduces power so much that the difference you see is non significant.

How do I know that the non-significant result is due to small sample size or due to that these two samples are truly no difference?

These are two sides of the same coin. The difference you observe is not significant because the sample size is not large enough. With huge sample sizes even tiny differences would produce very small p-values, in that case the question would be "Is this difference biologically meaningful?"

This is to illustrate the point. Produce two sets differing by small amount. The p-value for the difference is highly significant since the sample sizes are large. If you downsample one set the difference is no longer significant:

set.seed(1)
set1<- rbeta(n= 10000, 10, 10)

set.seed(2)
set2<- rbeta(n= 10000, 10, 9.5)

Difference between set 1 and 2 is significant even if the difference is small:

mean(set1); mean(set2)
[1] 0.5006102
[1] 0.5112549
wilcox.test(set1, set2)
# p-value = 1.023e-11

# Now reduce one set to 14 obs:
set.seed(3)
wilcox.test(sample(set1, size= 14), set2)
# p-value = 0.4413

Ram · Answer 3 · 2014-12-30

0

Entering edit mode

9.4 years ago

Asaf 10k

You can select a small random set of SNPs from the million and run them against your test set.

ADD COMMENT • link 9.4 years ago by Asaf 10k

1

Entering edit mode

Just be sure to repeat this random selection to see how variable it is. Set up a loop to repeat 100 times and make a histogram of the results. Then set it to 10,000 and go to lunch.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by karl.stamm 4.1k

0

Entering edit mode

Thank you. I'll try it

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by billzt ▴ 20