Question: How Does Average Heterozygosity Relate To Alellic Frequency Data
gravatar for Duarte Molha
7.5 years ago by
Duarte Molha230
Oxford, UK
Duarte Molha230 wrote:

Forgive me if this is a dumb question but I assumed that the Average Heterozygosity was somehow related to the average distribution of frequencies seen for each allele in any given variation, i.e a AvHet close to 0.5 and a avHetSE lower than 0.1 would probably mean that that variation with 2 detected alleles would have a relatively balanced allele count for each like 0.45 for Allele A and 0.55 for Allele B.

Is my thinking flawed? I ask this because I filtered dbsnp137 using AvHet of >= 0.4 and avHetSE < 0.1 and I am getting loads of variations where 1 allele is clearly dominant with frequency count above 0.8.

I've tried to get my head around the maths for the AvHet calculation in but I admit defeat. I am not a Mathematician by training and could not make sense of it.

allele • 15k views
ADD COMMENTlink modified 2.5 years ago by Shicheng Guo8.5k • written 7.5 years ago by Duarte Molha230

Thank you for the detailed answers guys. They do make it much clearer but I am still perplexed why I am seeing such high allelic frequencies for the values of AvHet I had used to filter the dataset.

ADD REPLYlink written 7.5 years ago by Duarte Molha230

Take as an example this SNP:

ID: rs112111814 Alleles:C/T AlleleCounts:2137, 49 allele_frequencies:0.977585,0.022415 avHet:0.5 avHetSE:0

How can such a variation have a AvHet of 0.5 when 97% of seen alleles are C and less then 3% are allele T ?

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Duarte Molha230

It depends what you mean by 97% and 3%. Are these based on the 2137 counts? If yes, 3% amounts to 60 sequences. I don't know how many individuals you have nor your criteria for calling heterozygotes, but if you have 20 individuals, it is possible that in a few cases 50% of them will be called hetero (based on an average of 6 reads). In cases like this, I would suspect the presence of paralogs in your data set. For instance, you think that you are observing a single locus, but in fact the data from 2 different loci get combined. Locus 1 is 100% allele A and locus 2 is 100% allele B. This would give you high heterozygozities. In fact, when you do have paralogs, removing SNPs where Het is greater than 0.5 or 0.6 may help removing those paralogs.

ADD REPLYlink written 7.5 years ago by Eric Normandeau10k

See... I am now sure I am a complete ignorant because I cannot understand your explanation :(

The link to the variation is here:;source=dbSNP;v=rs112111814;vdb=variation;vf=25173810

if contains 1092 individuals from 1K genome project with genotype calls: 1044 (C|C) / 48 (C|T)

The way I would look at this in the population tested there are 97% homozygotes (C|C) so according to your own graph the AvHet should be below 0.1 or am I just completely misunderstanding the calculations?

Thank you for your patience Eric

PS: this variation is a single locus.

ADD REPLYlink modified 7.0 years ago • written 7.5 years ago by Duarte Molha230
gravatar for confusedious
7.5 years ago by
confusedious420 wrote:

On this one, just go back to Hardy-Weinberg equilibrium to calculate what you might expect.

p^2 + 2pq + q^2 = 1

So if you have a heterozygosity of almost 0.5 (which is generally the maxiumum heterozygosity that you can have), it would mean that almost half of the individuals in the sample were of course heterozygotes. In this case, you could assume that the allele frequencies of both p and q are close to 0.5. Any other allele frequency would result in less heterozygosity.

Do be careful, however, when you use the word dominant. An allele frequency of 0.8 does not always mean that the allele is dominant. If a population recently underwent a bottleneck for example, there is a chance that a recessive allele could have been pushed to near fixation by drift.

For the sake of having an example, let's begin with a biallelic system where p = 0.8 and q = 0.2. Let's use this to calculate heterozygosity.

1 = 0.8^2 + 2 x 0.8 x 0.2 + 0.2^2 1 = 0.64 + 0.32 + 0.04

So in this case, heterozygosity would be 0.32

So that's the relationship between allele frequencies and heterozygosity out of the way.

It is my understanding that average heterozygosity, as an average, must be taken from across many loci. So if there is an average heterozygosity of 0.25 for example, you could theoretically have quite diverse heterozygosities from locus to locus. As such, you should not impute too much about the allele frequency of a given locus from an average heterozygosity that is taken from across many loci.

Does this help?

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by confusedious420

Average heterozygosity can be taken for one locus across many individuals.

ADD REPLYlink written 7.5 years ago by Eric Normandeau10k

Curious: For a biallelic marker, a diploid individual is either a heterozygote or they are not. Therefore, if you were to encode it, it would be binary. Would that then mean you would be taking a mean of a whole pile of ones and zeros? I could see that making some sense. If you are determining what portion of individuals are heterozygotes at a single locus, is this not just traditional heterozygosity? I don't mean to sound in any way cheeky or facetious here, as a newcomer myself I would just like to hear how it is done, and if so why it is useful.

ADD REPLYlink written 7.5 years ago by confusedious420

I see what you mean. From Van Dyke, F. 2002. Conservation Biology: Foundations, Concepts, Applications. 2nd ed. Springer. 477 p.: heterozygosity: carrying different alleles for a particular genetic locus, as opposed to homozygous (having the same alleles) or hemizygous (having one allele). Average heterozygosity is a measure of genetic diversity at the population scale and indicates the average proportion of individuals that are heterozygous for a given trait.

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by Eric Normandeau10k

Thank you for that Eric. It is good to clarify what is meant by this - I suppose one must assume that a sample one takes represents something like an average of the entire population, as sampling the whole population is usually impossible.

ADD REPLYlink written 7.5 years ago by confusedious420

In this case I believe the avHet value reported in dbSNP is calculated for that locus across many samples. so I believe the allele frequency should be directly related with the avHet according to the graph given by @Eric Normandeau

ADD REPLYlink written 7.5 years ago by Duarte Molha230

Oh I see. Multiple samples meaning multiple groups of individuals (populations perhaps)?

ADD REPLYlink written 7.5 years ago by confusedious420
gravatar for Eric Normandeau
7.5 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

If p is the frequency of allele A and q = 1 - p is the frequency of allele B, then the chance of having an heterozygous individual in a population with random mating is equal to 2pq. The relationship between p and AvHet is thus the following:

enter image description here

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Eric Normandeau10k

Thanks... the visuals do help :)

ADD REPLYlink written 7.5 years ago by Duarte Molha230

Still do not understand why I am getting the variations with a much more dominant allelic member when filtering using avHet>0.4 and AvHetSE >= 0.1. Following your chart, those parameters would give me an allelic frequency interval for allele A (on a biallelic variation) between 0.25 and 0.75. However many variations outside these limits are still passing the filtering limits. :S

ADD REPLYlink written 7.5 years ago by Duarte Molha230
gravatar for Shicheng Guo
2.5 years ago by
Shicheng Guo8.5k
Shicheng Guo8.5k wrote:

Average heterozygosity from all observations. Note: may be computed on small number of samples.Standard Error for the average heterozygosity. Average heterozygosity should not exceed 0.5 for bi-allelic single-base substitutions.

ADD COMMENTlink written 2.5 years ago by Shicheng Guo8.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1723 users visited in the last hour