Vcf Format: Different Ways To Express The Same Variant Information
Entering edit mode
10.4 years ago
Pablo ★ 1.9k

I was analysing sequences for two different samples and after creating VCF files we noticed that insertions are sometimes expressed different. This happens, even though we used exactly the same process to create them.

Here are three examples (lines have been truncated for brevity):

10  11805838    .   C   CT  211.50  .   AC=11;AF1=1;AN=12;CI95=1,1;DP4=0,0,19,11;DP=38;FQ=-125;G3=7.906e-63,1e-27,1;INDEL;MQ=49;PV4=1,1,0.21,1;SF=0,1,2,3,4,5
10  11805838    .   CG  CTG 200.33  .   AC=8;AF1=0.5;AN=12;CI95=0.5,0.5;DP4=7,7,10,4;DP=36;FQ=194;G3=7.924e-47,1,5e-50;INDEL;MQ=49;PV4=0.44,1,0.077,1;SF=0,1,2,3,4,5

X   122318386   .   A   AG  214.00  .   AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,10,12;DP=24;FQ=-101;G3=3.147e-53,1.585e-20,1;INDEL;MQ=48;SF=0,1,2,3,4,5
X   122318386   .   AC  AGC 214.00  .   AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,15,12;DP=29;FQ=-116;G3=3.147e-59,5.012e-25,1;INDEL;MQ=45;SF=0,1,2,3,4,5

11  118889247   .   AT  AGT 209.00  .   AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,8,3;DP=16;FQ=-67.5;G3=1.252e-43,6.308e-14,1;INDEL;MQ=36;SF=0,1,2,3,4,5
11  118889247   .   A   AG  214.00  .   AC=12;AF1=1;AN=12;CI95=1,1;DP4=0,0,7,6;DP=14;FQ=-73.5;G3=3.147e-50,2.512e-16,1;INDEL;MQ=47;SF=0,1,2,3,4,5

Unless I'm missing something obvious, these are three equivalent variants expressed differently (the same insertion is expressed in two different ways).

This ambiguity makes it harder to compare results. We used BWA + Samtools + VcfTools to create both files (using exactly the same parameters).

My questions are:

1-) Shouldn't VCF standard specify that the representation should be minimal to avoid this kind of confusion?

2-) Is there an easy way to fix the VCF files to avoid this?

Thank you.

vcf vcftools samtools variant • 3.6k views
Entering edit mode
10.4 years ago
Laura ★ 1.7k

I think the second example in each case is wrong and vcf should not be used in that way

Have a look at

but I think when expressing a variant which is longer than a single base pair your position and first base of both the reference and alternative allele string should always be the last base which is common between the 2 alleles in the reference genome but you are not meant to give the first base after the event aswell otherwise you get the confusion you described above

e.g by eye at least I would interpret:

11 118889247 . AT AGT 209.00 . AC=12

as a 1 to 2 base subsitution with T being replaced by GT so in the 2 genomes you would get ATN or AGTN


11 118889247 . A AG 214.00 . AC=12;

would be a 1 base insertion so you would have AN or AGN

Entering edit mode

I do agree with you that it "should" be incorrect. But, I've read the norm, and I don't see where it specifies that this is actually incorrect (may be I missed it?).

It does say that you have to express the "alternate non-reference alleles", but it doesn't say it should be in a minimal way. Again, I agree it should say that, otherwise you could just write the whole chromosome starting from that position, and you would be complying with the norm (which is ridiculous).

The other problem is that the same software creates both forms.

Entering edit mode
10.0 years ago
Allpowerde ★ 1.2k

Taking up this issue again: Is there a way to fix malformed vcf files (e.g. from complete genomics or illumina), which allow the reference to be empty for insertions.

Entering edit mode

This is not a forum. Please post a new question.


Login before adding your answer.

Traffic: 814 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6