Question: Extremely large difference between two sample sizes of Mann–Whitney U test
0
gravatar for billzt
4.9 years ago by
billzt20
Australia
billzt20 wrote:

I need to compare the derived allele frequncy spetrum of my studied mutations with the synonymous SNPs. The number of my studied mutations is very small, only 14, while the number of genomwide synonymous SNPs is up to 1 million. Therefore these two sample size are largely different and directly applied Mann–Whitney U test to them is of course no significant difference. How do I know that the non-significant result is due to small sample size or due to that these two samples are truely no difference?

ADD COMMENTlink modified 4.9 years ago by dariober10k • written 4.9 years ago by billzt20
2
gravatar for mikhail.shugay
4.9 years ago by
mikhail.shugay3.4k
Czech Republic, Brno, CEITEC
mikhail.shugay3.4k wrote:

You could perform permutation testing: select 14 SNPs at random from those 1mln, say 10000 times and build the histogram of their allele frequency means / medians. The number of times you get a larger allele frequency divided by 10000 will be the P-value.

You can also build the allele frequency distribution for all your SNPs and see how large the allele frequencies of your 14 SNPs are in respect to it.

Visualizing allele frequency distributions could give you some insight on what is happening in your data.

ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by mikhail.shugay3.4k

Thank you, I'll try it.

ADD REPLYlink written 4.9 years ago by billzt20
2
gravatar for dariober
4.9 years ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

Therefore these two sample size are largely different and directly applied Mann–Whitney U test to them is of course no significant difference.

I don't think the lack of significant difference is due to the different sample sizes, why should it be the case? (In fact I don't see the point in resampling SNPs from the large set). Rather, the small dataset reduces power so much that the difference you see is non significant.

How do I know that the non-significant result is due to small sample size or due to that these two samples are truely no difference?

These are two sides of the same coin. The difference you observe is not significant because the sample size is not large enough. With huge sample sizes even tiny differences would produce very small p-values, in that case the question would be "Is this difference biologically meaningful?"

This is to illustrate the point. Produce two sets differing by small amount. The p-value for the difference is highly significant since the sample sizes are large. If you downsample one set the difference is no longer significant:

set.seed(1)
set1<- rbeta(n= 10000, 10, 10)

set.seed(2)
set2<- rbeta(n= 10000, 10, 9.5)

Difference btw set 1 and 2 is significant even if the difference is small:

mean(set1); mean(set2)
[1] 0.5006102
[1] 0.5112549
wilcox.test(set1, set2)
# p-value = 1.023e-11

# Now reduce one set to 14 obs:
set.seed(3)
wilcox.test(sample(set1, size= 14), set2)
# p-value = 0.4413 
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by dariober10k
0
gravatar for Asaf
4.9 years ago by
Asaf6.5k
Israel
Asaf6.5k wrote:

You can select a small random set of SNPs from the million and run them against your test set.

ADD COMMENTlink written 4.9 years ago by Asaf6.5k
1

Just be sure to repeat this random selection to see how variable it is. Set up a loop to repeat 100 times and make a histogram of the results.  Then set it to 10,000 and go to lunch. 

ADD REPLYlink written 4.9 years ago by karl.stamm3.5k

Thank you. I'll try it

ADD REPLYlink written 4.9 years ago by billzt20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 888 users visited in the last hour