Question: GATK HC not calling variants at the edges after clipping probes
0
gravatar for nbhardwaj
6 days ago by
nbhardwaj80
United States
nbhardwaj80 wrote:

Hi, I am facing a strange issue with GATK 3.4. I have a set of PE fastq files. I first ran the variant calling pipeline using the following steps:

bwa mem --> sam to bam --> mark duplicates using picard --> RealignerTargetCreator GATK 3.4 --> IndelRealigner --> HaplotypeCaller --> GenotypeGVCFs

Then I saw that the probe sequences that were used to design the region of interest (its a custom-designed panel) overlapped with the some variants of interest. So, the genotypes of some of these variants was 0/1 (due to probes seqs in those positions) instead of the the true 1/1 (I am using a gold standard sample).

So, then I clipped the first 27 bases of each read (the probes are all around that range 22-31 nts with 27 being the most common length). And then I ran the same above pipeline but some of the variants were not called in spite of all reads having those variants.

I am attaching a snapshot of the bam file alignment with the variant in orange in the center. Top track is the first case (before clipping the probes) where the probes are present in the upper reference block without the variant and the lower block with the insert that carry the variant and hence the 0/1 call.

https://ibb.co/dTpRfw

Snapshot

Lower track is after the clipping of the probe sequences and so only the insert with the variant are left. But the variant is still not called in the VCF file.

I read in another thread that this happens because the tool needs 50bp on either side to do proper reassembly. Is that correct? If so, how do I get around it? Should I not use the IndelRealignment? Any other suggestion to solve this issue?

Thanks a bunch in advance!!

sequencing gatk • 87 views
ADD COMMENTlink modified 6 days ago by Kevin Blighe6.6k • written 6 days ago by nbhardwaj80
0
gravatar for Kevin Blighe
6 days ago by
Kevin Blighe6.6k
Republic of Ireland (Éire)
Kevin Blighe6.6k wrote:

I imagine that the variant is remaining undetected due to 'read position' bias, because it's now only being called at the end of one set of reads and lacks support from other mappings. Take a look at ReadPosRankSumTest:

Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. However, some variants located near the edges of sequenced regions will necessarily be covered by the ends of reads, so we can't just set an absolute "minimum distance from end of read" threshold. That is why we use a rank sum test to evaluate whether there is a difference in how well the reference allele and the alternate allele are supported.

I think that you will find it difficult to call this. If you managed to reduce the read end bias, you would then call false-positives in other reads.

One method that I would encourage you to try is to 'reshuffle' your reads and then re-call variants. Use Picard DownsampleSam on your aligned BAM and extract reads 'randomly' at frequencies of 0.9, 0.8, 0.7, etc., and then call variants on the outputs. I would expect the variant to be called correctly in one of these as, by 'reshuffling', the probabilities of calling the variant are also altered.

ADD COMMENTlink modified 6 days ago • written 6 days ago by Kevin Blighe6.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1468 users visited in the last hour