Comparing Allele Frequency Between 1000 Genomes And Nhlbi
2
0
Entering edit mode
10.5 years ago
User 1933 ▴ 340

I have a set of variants. These variants are also reported in 1000 Genome project (summary table) as well as (NHLBI - National Heart Lung Blood Institute) . I waned to see the frequency of these variants (allele frequencies) in these two project and see if they agree on each other. For such a comparison I used Mann-Whitney test.

here are my questions,

  1. for making such a comparison I expect to get a not-significant p-value. Does Mann-Whitney distribution a right test ?!
  2. is my expectation logical !?

Thank you,

comparison • 6.6k views
ADD COMMENT
0
Entering edit mode

I think it is not the right approach, but it is hard to put the finger on something because your post is very unclear. What is "genome 1000" what is "NHLBI". How do you generate your count table? Are you looking for a count difference for each allele? What is the question after all? "I want to see the frequency of these variants in genome 1000 and NHLBI" is not a valid question for a statistical test, because you can easily extract the allele frequencies (I guess that is what you mean with "frequency of these variants").

ADD REPLY
0
Entering edit mode

thanks - I tried to update and make clear your points in the question.

ADD REPLY
1
Entering edit mode
10.5 years ago
Michael 54k

I think that either Person's Chi-squared test for independence or Fisher's exact test will be appropriate. In the case of Chi-squared test the null-hypotheses is that the allele counts between 1k genomes and NHLBI are independent (say "different"), and the alternative hypothesis is that they are significantly dependent (say "come from a sufficiently similar distribution"). You will have to check if you can formulate your research question in terms of the null and alternative hypotheses, and try to format your data to fit the test (e.g. Fisher's test requires counts not frequencies).


Why Mann-Whitney U-test (Wilcon's Rank sum statistic) might not be appropriate: MWU-test is a test for the null that two populations are the same against the alternative that the populations are different, without making an assumption about the distribution. The only requirement is that the sampling is done from two populations, where I interpret "population" being generated by the same random process by repeated sampling of the same random variable, which is not the case for allele frequencies of different SNPs. (We cannot count repetitions per individual, because each individual sample contributes by 0 or 1 to the MAF ). Or in other words you would be comparing apples and oranges. I think that this is also a reasonable assumption for real allele frequencies of SNPs.

An example: imagine our test set consists of two SNPs of which we know the true MAF for the whole population, one with 0.1, the other with 0.4. If you put them together you might get a sampling vector of e.g. x=( 0.08, 0.45). However we know that this vector does not consist of values sample from the same random process, because we know a priori that these samples do not, because the process we sample from consists of one different "allele-generating process" with its own variance for each SNP.

It is a bit harder to argue why a test is not appropriate, so if someone has a more well-founded argument for or against that will be welcome.

ADD COMMENT
0
Entering edit mode

I am going to update my question - with comparison of Chi-squared test and MWU plus an additional plot of these distribution. I found this http://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html R function - however, it said for count data. Do you recommend this function for such compression, or the Chi-squared test you are addressing it is designed for frequency distributions ?

ADD REPLY
0
Entering edit mode

chisq.test is suitable for contingency tables of counts. I would recommend to use allele counts, not the normalized frequencies, if possible, because the frequencies contain less information.

ADD REPLY
0
Entering edit mode

we don't have allele count information for these two projects (1000 Genomes / NHLBI). I am looking into these distributions through Ingenuity and it has only provided the frequency. I would be happy to do statistics on frequencies.

ADD REPLY
1
Entering edit mode
10.5 years ago

The first thing to do is to plot the two distributions (the site frequency spectrum) and compare them:

enter image description here

Then, a Mann-Whitney is a good option to compare the two distributions. However, if you have a large number of individuals, it is very likely that the Mann-Whitney, or any other test, will give you a significant p-value, even if the two means are close

ADD COMMENT
0
Entering edit mode

I have added the attempt to argue why MWU-test is not appropriate in this case, maybe we can discuss this? Appreciate the attempt to plot the distribution of MAF, will it look like the ones you are showing?

ADD REPLY
0
Entering edit mode

That is the case - means are closed and I see the impact of number of samples on my p-value. Is there any treatment you can recommend ?

ADD REPLY

Login before adding your answer.

Traffic: 2628 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6