Question: How Does Average Heterozygosity Relate To Alellic Frequency Data
2
Duarte Molha210 wrote:

Forgive me if this is a dumb question but I assumed that the Average Heterozygosity was somehow related to the average distribution of frequencies seen for each allele in any given variation, i.e a AvHet close to 0.5 and a avHetSE lower than 0.1 would probably mean that that variation with 2 detected alleles would have a relatively balanced allele count for each like 0.45 for Allele A and 0.55 for Allele B.

Is my thinking flawed? I ask this because I filtered dbsnp137 using AvHet of >= 0.4 and avHetSE < 0.1 and I am getting loads of variations where 1 allele is clearly dominant with frequency count above 0.8.

I've tried to get my head around the maths for the AvHet calculation in http://www.ncbi.nlm.nih.gov/projects/SNP/Hetfreq.html but I admit defeat. I am not a Mathematician by training and could not make sense of it.

allele • 13k views
modified 16 months ago by Shicheng Guo7.8k • written 6.4 years ago by Duarte Molha210

Thank you for the detailed answers guys. They do make it much clearer but I am still perplexed why I am seeing such high allelic frequencies for the values of AvHet I had used to filter the dataset.

Take as an example this SNP:

ID: rs112111814 Alleles:C/T AlleleCounts:2137, 49 allele_frequencies:0.977585,0.022415 avHet:0.5 avHetSE:0

How can such a variation have a AvHet of 0.5 when 97% of seen alleles are C and less then 3% are allele T ?

It depends what you mean by 97% and 3%. Are these based on the 2137 counts? If yes, 3% amounts to 60 sequences. I don't know how many individuals you have nor your criteria for calling heterozygotes, but if you have 20 individuals, it is possible that in a few cases 50% of them will be called hetero (based on an average of 6 reads). In cases like this, I would suspect the presence of paralogs in your data set. For instance, you think that you are observing a single locus, but in fact the data from 2 different loci get combined. Locus 1 is 100% allele A and locus 2 is 100% allele B. This would give you high heterozygozities. In fact, when you do have paralogs, removing SNPs where Het is greater than 0.5 or 0.6 may help removing those paralogs.

See... I am now sure I am a complete ignorant because I cannot understand your explanation :(

The link to the variation is here: http://www.ensembl.org/Homo_sapiens/Variation/Explore?r=3:197694045-197695045;source=dbSNP;v=rs112111814;vdb=variation;vf=25173810

if contains 1092 individuals from 1K genome project with genotype calls: 1044 (C|C) / 48 (C|T)

The way I would look at this in the population tested there are 97% homozygotes (C|C) so according to your own graph the AvHet should be below 0.1 or am I just completely misunderstanding the calculations?

Thank you for your patience Eric

PS: this variation is a single locus.

5
confusedious420 wrote:

On this one, just go back to Hardy-Weinberg equilibrium to calculate what you might expect.

p^2 + 2pq + q^2 = 1

So if you have a heterozygosity of almost 0.5 (which is generally the maxiumum heterozygosity that you can have), it would mean that almost half of the individuals in the sample were of course heterozygotes. In this case, you could assume that the allele frequencies of both p and q are close to 0.5. Any other allele frequency would result in less heterozygosity.

Do be careful, however, when you use the word dominant. An allele frequency of 0.8 does not always mean that the allele is dominant. If a population recently underwent a bottleneck for example, there is a chance that a recessive allele could have been pushed to near fixation by drift.

For the sake of having an example, let's begin with a biallelic system where p = 0.8 and q = 0.2. Let's use this to calculate heterozygosity.

1 = 0.8^2 + 2 x 0.8 x 0.2 + 0.2^2 1 = 0.64 + 0.32 + 0.04

So in this case, heterozygosity would be 0.32

So that's the relationship between allele frequencies and heterozygosity out of the way.

It is my understanding that average heterozygosity, as an average, must be taken from across many loci. So if there is an average heterozygosity of 0.25 for example, you could theoretically have quite diverse heterozygosities from locus to locus. As such, you should not impute too much about the allele frequency of a given locus from an average heterozygosity that is taken from across many loci.

Does this help?

2

Average heterozygosity can be taken for one locus across many individuals.

Curious: For a biallelic marker, a diploid individual is either a heterozygote or they are not. Therefore, if you were to encode it, it would be binary. Would that then mean you would be taking a mean of a whole pile of ones and zeros? I could see that making some sense. If you are determining what portion of individuals are heterozygotes at a single locus, is this not just traditional heterozygosity? I don't mean to sound in any way cheeky or facetious here, as a newcomer myself I would just like to hear how it is done, and if so why it is useful.

1

I see what you mean. From Van Dyke, F. 2002. Conservation Biology: Foundations, Concepts, Applications. 2nd ed. Springer. 477 p.: heterozygosity: carrying different alleles for a particular genetic locus, as opposed to homozygous (having the same alleles) or hemizygous (having one allele). Average heterozygosity is a measure of genetic diversity at the population scale and indicates the average proportion of individuals that are heterozygous for a given trait.

Thank you for that Eric. It is good to clarify what is meant by this - I suppose one must assume that a sample one takes represents something like an average of the entire population, as sampling the whole population is usually impossible.

1

In this case I believe the avHet value reported in dbSNP is calculated for that locus across many samples. so I believe the allele frequency should be directly related with the avHet according to the graph given by @Eric Normandeau

Oh I see. Multiple samples meaning multiple groups of individuals (populations perhaps)?

3
Eric Normandeau10k wrote:

If `p` is the frequency of allele `A` and `q = 1 - p` is the frequency of allele `B`, then the chance of having an heterozygous individual in a population with random mating is equal to `2pq`. The relationship between `p` and `AvHet` is thus the following: Thanks... the visuals do help :)

Still do not understand why I am getting the variations with a much more dominant allelic member when filtering using avHet>0.4 and AvHetSE >= 0.1. Following your chart, those parameters would give me an allelic frequency interval for allele A (on a biallelic variation) between 0.25 and 0.75. However many variations outside these limits are still passing the filtering limits. :S

0
Shicheng Guo7.8k wrote:

Average heterozygosity from all observations. Note: may be computed on small number of samples.Standard Error for the average heterozygosity. Average heterozygosity should not exceed 0.5 for bi-allelic single-base substitutions. https://www.ncbi.nlm.nih.gov/SNP/Hetfreq.html