Next-generation DNA sequencing: How to tell systematic errors from real SNVs?
2
0
Entering edit mode
8.7 years ago

Hi all,

I hope you're well. I have one question regarding variant calling in NGS data.

We have sequenced a number of human germline exomes, captured with Agilent SureSelect 50mb All Exon kit, in the Illumina GAII platform, paired-end, 75bp reads. We have aligned with BWA and called variants with Samtools mpileup.

I was looking at one particular variant, located at GRCh37 7:124499165. This variant is called heterozygously with OK SNP quality (QUAL 20-41) in 6/41 exomes, and with low qualities in a further 16 exomes. Sequencing depths for the exomes where variant has been called with OK quality ranges from 61 to 139 HQ reads, however, although the reference allele (A) is supported by reads in both directions (in the majority of cases 2-4 fold more reads in the forward direction though), the alternative allele is supported, in all cases, by only reads in the reverse direction (range 17 to 45 per exome). Of course, these variants will not pass the strand-bias test, so will be filtered out during analysis.

However, this variant is present in the exac dataset with a high AF: http://exac.broadinstitute.org/variant/7-124499165-A-C, which makes me think that maybe it's not being dropped by exome post-filtering. I believe this variant might be a systematic mapping/calling error because 1. the reads supporting the variant are in only one direction in all cases (at least in one batch of data), suggesting a sequence-specific error, and 2. because other people seem to have observed this variant, which apparently passed all their filters, and attempted Sequenom validation but failed (Bass AJ et al. Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion. Nat Genet 43, 964–968 (2011)).

However, the question becomes, is there any way to tell a (possibly) systematic error of this kind from a real variant? The reason I'm interested in this question is because we are doing a new experiment in which we are sequencing over this region with Fluidigm technology, in which we get reads in only one direction so we lose the ability to do strand-bias testing.

Thank you very much,

Daniela

SNP next-gen sequencing alignment sequence • 2.9k views
ADD COMMENT
0
Entering edit mode

If you suspect something is a mapping/calling artifact, I suggest you try a different aligner (such as BBMap) and possibly a different variant-caller (such as GATK) to see if the variation remains. You could even try a different reference (say, HG20).

ADD REPLY
0
Entering edit mode

Thanks very much for your suggestion Brian. I used a different caller (GATK HaplotypeCaller) and indeed it seems the variant is not being called. However I'm a bit puzzled as to why it is so common in the exac data if I suspect loads of those exomes used GATK. Maybe they used GATK UnifiedGenotyper, which does call it. I will ask over there. Thanks again! :)

ADD REPLY
0
Entering edit mode
8.6 years ago
Christian ★ 3.0k
Check if the variant overlaps with any known repetitive element, segmental duplication, or CNV. If it does, it is less likely true. You can get this information for example from UCSC tracks.
ADD COMMENT
0
Entering edit mode

Thank you for your useful suggestion, Christian. I have activated all repeat tracks in the UCSC Genome Browser for this variant but none seem to overlap it... will keep looking for the solution.

ADD REPLY

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6