Question

Low coverage whole genome sequencing reveal excess heterozygosity for multiple SNPs. How to filter?

0

Entering edit mode

4 months ago

beausoleilmo ▴ 580

I've been talking to other researchers that are using low-coverage whole genome sequencing and found that a couple had the same problem: it appears that many SNPs (a lot more than expected) have higher heterozygosity. I heard this mentioned for Salmonids, Lupine, Birds, etc.

Is is a widespread phenomena?
What is causing this?
How to 'filter' so that we remove this excess heterozygosity?

The plot below shows the genotype frequency (y) as a function of allele frequency (x).

enter image description here

lcWGS WGS heterozygosity • 478 views

ADD COMMENT • link 4 months ago by beausoleilmo ▴ 580

score 0 · Answer 1 · 2023-12-01

I don't understand your plot. Perhaps a legend would help? I also don't know what you mean by "genotype frequency"; is that the ratio of het/homo alleles? Or do you mean population-wide? Or... looking at it again, I think you mean that there are zero variants marked heterozygous on your graph? Those fit lines are ridiculous, by the way; the graph would be clearer without them. There are 3 straight lines (y=2x, y=0, and y=1+(-2x)) with some outliers. But I don't know why you have nothing with an AF>0.6; looks like some cutoff parameter.

But anyway, it's logical that certain low coverage ranges will have outsized numbers of het SNPs; that's because if you have a read depth of 3 and 1 read has a sequence error, presto, you get a het SNP called... depending on your settings. And depth. False homozygous SNP calls are much less likely unless you fail to do duplicate removal or call variants at depth=1.

You can get rid of many of these false het calls by sequencing at a higher depth, doing duplicate removal, doing quality-score recalibration, flowcell-position-sensitive read-filtering, using a better aligner that doesn't call indels as substitutions, and setting your minimum variant-calling depth to above whatever these spurious SNPs are called at. Hmm, a using better and non-human-specific variant-caller is a good idea too. Specifically for variant-calling from noisy Novaseq data, it's also useful to cull particularly error-prone reads when doing population studies or very low sequencing depths.

Basically, I'd be interested in:

The depth at which these false calls occur;
The sequencing platform;
The preprocessing steps employed;
The aligner;
The variant-caller;
The variant-caller's parameters;
Whether the DNA was amplified, and how much;
Whether you are talking about individual or population studies;
And a legend for graph.