I don't understand your plot. Perhaps a legend would help? I also don't know what you mean by "genotype frequency"; is that the ratio of het/homo alleles? Or do you mean population-wide? Or... looking at it again, I think you mean that there are zero variants marked heterozygous on your graph? Those fit lines are ridiculous, by the way; the graph would be clearer without them. There are 3 straight lines (y=2x, y=0, and y=1+(-2x)) with some outliers. But I don't know why you have nothing with an AF>0.6; looks like some cutoff parameter.
But anyway, it's logical that certain low coverage ranges will have outsized numbers of het SNPs; that's because if you have a read depth of 3 and 1 read has a sequence error, presto, you get a het SNP called... depending on your settings. And depth. False homozygous SNP calls are much less likely unless you fail to do duplicate removal or call variants at depth=1.
You can get rid of many of these false het calls by sequencing at a higher depth, doing duplicate removal, doing quality-score recalibration, flowcell-position-sensitive read-filtering, using a better aligner that doesn't call indels as substitutions, and setting your minimum variant-calling depth to above whatever these spurious SNPs are called at. Hmm, a using better and non-human-specific variant-caller is a good idea too. Specifically for variant-calling from noisy Novaseq data, it's also useful to cull particularly error-prone reads when doing population studies or very low sequencing depths.
Basically, I'd be interested in:
- The depth at which these false calls occur;
- The sequencing platform;
- The preprocessing steps employed;
- The aligner;
- The variant-caller;
- The variant-caller's parameters;
- Whether the DNA was amplified, and how much;
- Whether you are talking about individual or population studies;
- And a legend for graph.