Question: Why do indel variants need a leading base according to the VCF spec?
gravatar for William
3.1 years ago by
William4.6k wrote:

According to the VCF spec indel variants need a leading non polymorph nucleotide for all alleles.

For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event;

Why is this? Other systems accept or require variants without the leading base.

Simple deletion:


Simple insertion:


The starting positions of the variants of course also differs by 1 between these two notation forms.

Is one notation form better than the other?

Is lossless conversion always possible between these two annotation forms?

I.e. just add or remove the first leading base and increment of decrease POS by 1? Is there a script/tool/code snippet that already does this?

vcf • 1.0k views
ADD COMMENTlink modified 3.1 years ago by RamRS27k • written 3.1 years ago by William4.6k

There's no simple bidierctional conversion due to the fact that indels at the beginning of a contig need a different base (and position) than those not at the beginning, and you need to know the reference, as well. But VCF->Other should be lossless. Although, with multi-allelic sites all on the same line (say, an insertion and a deletion at the same location), I don't even know what you're supposed to do, or if it can even be represented. In short, the VCF version of indels is a weird design that causes a lot of headaches. The "Other" version is much easier to work with.

That said, assuming one variant per line, and assuming VCF->Other (which is much easier than the reverse), here's a snippet of my code (note that my internal format is 0-based half-open):

if(alt.length!=reflen && alt.length>0){
    alt=Arrays.copyOfRange(alt, 1, alt.length);
ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Brian Bushnell17k
gravatar for RamRS
3.1 years ago by
Houston, TX
RamRS27k wrote:

I'd say convention is the principal reason. Ideally, you're looking at semantics where the base REF, found at position POS is substituted by the base ALT. In Indels without specifying a base for REF, this semantic breaks down.

BTW, I'm sure this has been often debated - I recall an article I read from way back when on the flaws in the VCF format, including the inconsistent representation of indels.

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by RamRS27k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 828 users visited in the last hour