I was analysing sequences for two different samples and after creating VCF files we noticed that insertions are sometimes expressed different. This happens, even though we used exactly the same process to create them.
Here are three examples (lines have been truncated for brevity):
10 11805838 . C CT 211.50 . AC=11;AF1=1;AN=12;CI95=1,1;DP4=0,0,19,11;DP=38;FQ=-125;G3=7.906e-63,1e-27,1;INDEL;MQ=49;PV4=1,1,0.21,1;SF=0,1,2,3,4,5 10 11805838 . CG CTG 200.33 . AC=8;AF1=0.5;AN=12;CI95=0.5,0.5;DP4=7,7,10,4;DP=36;FQ=194;G3=7.924e-47,1,5e-50;INDEL;MQ=49;PV4=0.44,1,0.077,1;SF=0,1,2,3,4,5 X 122318386 . A AG 214.00 . AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,10,12;DP=24;FQ=-101;G3=3.147e-53,1.585e-20,1;INDEL;MQ=48;SF=0,1,2,3,4,5 X 122318386 . AC AGC 214.00 . AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,15,12;DP=29;FQ=-116;G3=3.147e-59,5.012e-25,1;INDEL;MQ=45;SF=0,1,2,3,4,5 11 118889247 . AT AGT 209.00 . AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,8,3;DP=16;FQ=-67.5;G3=1.252e-43,6.308e-14,1;INDEL;MQ=36;SF=0,1,2,3,4,5 11 118889247 . A AG 214.00 . AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,7,6;DP=14;FQ=-73.5;G3=3.147e-50,2.512e-16,1;INDEL;MQ=47;SF=0,1,2,3,4,5
Unless I'm missing something obvious, these are three equivalent variants expressed differently (the same insertion is expressed in two different ways).
This ambiguity makes it harder to compare results. We used BWA + Samtools + VcfTools to create both files (using exactly the same parameters).
My questions are:
1-) Shouldn't VCF standard specify that the representation should be minimal to avoid this kind of confusion?
2-) Is there an easy way to fix the VCF files to avoid this?