I noticed that GATK sometimes calls two consecutive indels like the two below. One, at position 3479486, is a variation from AAG to A. The second, at 3479487, is a variation from AG to A. Both indels survived a quite strict quality filtering, are both homozygous and both supported by 54 reads. You can see the two lines below.
chr13 3479486 . AAG A 1640.73 PASS AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=53.87;MQ0=0;QD=29.30;SOR=0.767 GT:AD:DP:GQ:PL 1/1:0,54:56:99:1678,151,0 chr13 3479487 . AG A 1448.73 PASS AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=53.87;MQ0=0;QD=25.87;SOR=0.767 GT:AD:DP:GQ:PL 1/1:0,54:56:99:1486,160,0
My reference in the region is as follows
This convinced me that GATK somehow got confused, and is calling two different variants for the same event. Realignment near indels has already been performed.
For downstream analysis I want to find a general way of dealing with such issue by removing one of the two.
Are you aware of any solution for this?
EDIT Sept 2nd*****
I found that the solution provided in a Biostars post might work for me (so maybe my question is duplicate?)
bcftools filter --IndelGap 3 infile.vcf > outfile.vcf
I will stick to it, but too minor improvements would be great! 1) I would like to remove indels that overlap, irrespective of the distance 2) I would like to select which indel to remove based on some quality information (looks like bcftools always removes the second instance)