Low coverage whole genome sequencing reveal excess heterozygosity for multiple SNPs. How to filter?
Entering edit mode
3 months ago
beausoleilmo ▴ 580

I've been talking to other researchers that are using low-coverage whole genome sequencing and found that a couple had the same problem: it appears that many SNPs (a lot more than expected) have higher heterozygosity. I heard this mentioned for Salmonids, Lupine, Birds, etc.

  • Is is a widespread phenomena?
  • What is causing this?
  • How to 'filter' so that we remove this excess heterozygosity?

The plot below shows the genotype frequency (y) as a function of allele frequency (x).

enter image description here

lcWGS WGS heterozygosity • 386 views
Entering edit mode
3 months ago

I don't understand your plot. Perhaps a legend would help? I also don't know what you mean by "genotype frequency"; is that the ratio of het/homo alleles? Or do you mean population-wide? Or... looking at it again, I think you mean that there are zero variants marked heterozygous on your graph? Those fit lines are ridiculous, by the way; the graph would be clearer without them. There are 3 straight lines (y=2x, y=0, and y=1+(-2x)) with some outliers. But I don't know why you have nothing with an AF>0.6; looks like some cutoff parameter.

But anyway, it's logical that certain low coverage ranges will have outsized numbers of het SNPs; that's because if you have a read depth of 3 and 1 read has a sequence error, presto, you get a het SNP called... depending on your settings. And depth. False homozygous SNP calls are much less likely unless you fail to do duplicate removal or call variants at depth=1.

You can get rid of many of these false het calls by sequencing at a higher depth, doing duplicate removal, doing quality-score recalibration, flowcell-position-sensitive read-filtering, using a better aligner that doesn't call indels as substitutions, and setting your minimum variant-calling depth to above whatever these spurious SNPs are called at. Hmm, a using better and non-human-specific variant-caller is a good idea too. Specifically for variant-calling from noisy Novaseq data, it's also useful to cull particularly error-prone reads when doing population studies or very low sequencing depths.

Basically, I'd be interested in:

  • The depth at which these false calls occur;
  • The sequencing platform;
  • The preprocessing steps employed;
  • The aligner;
  • The variant-caller;
  • The variant-caller's parameters;
  • Whether the DNA was amplified, and how much;
  • Whether you are talking about individual or population studies;
  • And a legend for graph.
Entering edit mode

You're right, I wasn't explaining the problem clearly. Thanks for the directions!

  • The depth; coverage ~3.00X ± 2.50 SD
  • The sequencing platform; NovaSeq 6000 S4 (Illumina, CA)
  • The preprocessing steps employed; https://bitbucket.org/wegmannlab/atlas-pipeline/wiki/Home with Atlas
  • The aligner; BWA
  • The variant-caller; https://bitbucket.org/wegmannlab/atlas-pipeline/wiki/Home with Atlas
  • The variant-caller's parameters; Have to check ... There was an MAF of 0.05.
  • Whether the DNA was amplified, and how much; PCR, 13 cycles
  • but other manipulations were done because of bubbles Whether you are
  • talking about individual or population studies; These are individuals
  • And a legend for graph. red = AltAlt, blue = RefAlt, green = RefRef

The graph is based on per locus allele frequency (x axis) in relation to the 3 genotypes frequencies possible at each loci (red = AltAlt, blue = RefAlt, green = RefRef). This graph is basically showing the expected genotype frequencies (Hardy-Weinberg; HW, as the lines in the plot), with the 'real' genotype frequencies found in the population for each loci. This is way off what is expected and almost all sites are significantly different from what is expected assuming HW. I expect to find some sites that are not in HW but not all of them! (see something like this: https://gcbias.org/2011/10/13/population-genetics-course-resources-hardy-weinberg-eq/, note that to get AF>0.5, they randomly selected sites to be on the other side, but you are right, I don't know on top of my head, why there is no AF >0.6 here...)

The thing is that we looked at lcWGS data that was already published and found the same excess heterozygosity.


Login before adding your answer.

Traffic: 1563 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6