Question: Relationship Between Dp Score And Coverage Visualized By Igv
gravatar for Tancata
5.6 years ago by
Newcastle, UK
Tancata200 wrote:


I have been analyzing some resequencing data (8 genomes from a non-model organism) and I've been trying to filter SNPs by quality. The organism is haploid (I'm pretty sure), but I noticed a small proportion of heterozygous SNP calls after bwa alignment to the reference genome and fairly standard variant calling with samtools mpileup. When I looked at the putative heterozygous positions in IGV, I saw an (apparently) very obvious explanation for them: they were all called at positions with incredibly high coverage (e.g. 2000X, compared to 50-100X across other contigs). I guess these are reads from repetitive loci, of which there are quite a few in the genome. (duplicate reads were removed as part of the initial samtools pipeline).

I wanted to filter out these cases to identify a set of high-quality SNPs for variant score recalibration. However, the DP scores (and DP4 scores) for these spurious "heterozygous" sites were barely any higher than for other homozygous sites which had much lower coverage as visualized by IGV. (mean DP = 32 for "heterozygous", 24 for "homozygous).

Using bedtools coverage on the initial VCF files, I got coverage values corresponding to the massive peaks seen in IGV (2000+). There was a massive difference in bedtools coverage scores between hetero and homo calls - median 1664 (hetero) vs 260 (homo). So I was able to use these values to easily filter out the spurious high-coverage cases.

My question: why were the DP values so low for these sites, and is there a way to access this info. (which turned out to be very useful) for filtering SNPs using mpileup?

A follow-up question: I noticed that the coverage for homozygous SNPs (median 260) was a good deal higher than for positions matching the reference assembly (which were in the 50-100x range) [and the heterozygous SNPs were even higher]. Is this worrisome? The organism is a crazy parasite with low GC content - about 25-30% at most. Could this result from less obvious cases of reads from other loci mapping to the wrong place and causing false-positive SNP calls?

variant-calling ngs samtools • 1.9k views
ADD COMMENTlink modified 5.6 years ago by Chris Miller20k • written 5.6 years ago by Tancata200
gravatar for Chris Miller
5.6 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

Were the DP scores filtered to include only high-quality reads? If this is a repetitive region, you might expect to see a lot of poorly mapping reads.

ADD COMMENTlink written 5.6 years ago by Chris Miller20k

Yes, they were filtered. Perhaps this is why they were so low.

ADD REPLYlink written 5.6 years ago by Tancata200
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1300 users visited in the last hour