I don't understand your plot. Perhaps a legend would help? I also don't know what you mean by "genotype frequency"; is that the ratio of het/homo alleles? Or do you mean population-wide? Or... looking at it again, I think you mean that there are zero variants marked heterozygous on your graph? Those fit lines are ridiculous, by the way; the graph would be clearer without them. There are 3 straight lines (y=2x, y=0, and y=1+(-2x)) with some outliers. But I don't know why you have nothing with an AF>0.6; looks like some cutoff parameter.
But anyway, it's logical that certain low coverage ranges will have outsized numbers of het SNPs; that's because if you have a read depth of 3 and 1 read has a sequence error, presto, you get a het SNP called... depending on your settings. And depth. False homozygous SNP calls are much less likely unless you fail to do duplicate removal or call variants at depth=1.
You can get rid of many of these false het calls by sequencing at a higher depth, doing duplicate removal, doing quality-score recalibration, flowcell-position-sensitive read-filtering, using a better aligner that doesn't call indels as substitutions, and setting your minimum variant-calling depth to above whatever these spurious SNPs are called at. Hmm, a using better and non-human-specific variant-caller is a good idea too. Specifically for variant-calling from noisy Novaseq data, it's also useful to cull particularly error-prone reads when doing population studies or very low sequencing depths.
Basically, I'd be interested in:
- The depth at which these false calls occur;
- The sequencing platform;
- The preprocessing steps employed;
- The aligner;
- The variant-caller;
- The variant-caller's parameters;
- Whether the DNA was amplified, and how much;
- Whether you are talking about individual or population studies;
- And a legend for graph.
You're right, I wasn't explaining the problem clearly. Thanks for the directions!
The graph is based on per locus allele frequency (x axis) in relation to the 3 genotypes frequencies possible at each loci (red = AltAlt, blue = RefAlt, green = RefRef). This graph is basically showing the expected genotype frequencies (Hardy-Weinberg; HW, as the lines in the plot), with the 'real' genotype frequencies found in the population for each loci. This is way off what is expected and almost all sites are significantly different from what is expected assuming HW. I expect to find some sites that are not in HW but not all of them! (see something like this: https://gcbias.org/2011/10/13/population-genetics-course-resources-hardy-weinberg-eq/, note that to get AF>0.5, they randomly selected sites to be on the other side, but you are right, I don't know on top of my head, why there is no AF >0.6 here...)
The thing is that we looked at lcWGS data that was already published and found the same excess heterozygosity.