Why do indel variants need a leading base according to the VCF spec?
1
0
Entering edit mode
7.0 years ago
William ★ 5.3k

According to the VCF spec indel variants need a leading non polymorph nucleotide for all alleles.

https://samtools.github.io/hts-specs/VCFv4.2.pdf

For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event;

Why is this? Other systems accept or require variants without the leading base.

Simple deletion:

VCF: CAGTAGTGA/C
Other: AGTAGTGA/-

Simple insertion:

VCF: C/CAGTAGTGA 
Other: -/AGTAGTGA

The starting positions of the variants of course also differs by 1 between these two notation forms.

Is one notation form better than the other?

Is lossless conversion always possible between these two annotation forms?

I.e. just add or remove the first leading base and increment of decrease POS by 1? Is there a script/tool/code snippet that already does this?

vcf • 1.9k views
ADD COMMENT
1
Entering edit mode

There's no simple bidierctional conversion due to the fact that indels at the beginning of a contig need a different base (and position) than those not at the beginning, and you need to know the reference, as well. But VCF->Other should be lossless. Although, with multi-allelic sites all on the same line (say, an insertion and a deletion at the same location), I don't even know what you're supposed to do, or if it can even be represented. In short, the VCF version of indels is a weird design that causes a lot of headaches. The "Other" version is much easier to work with.

That said, assuming one variant per line, and assuming VCF->Other (which is much easier than the reverse), here's a snippet of my code (note that my internal format is 0-based half-open):

if(alt.length!=reflen && alt.length>0){
    alt=Arrays.copyOfRange(alt, 1, alt.length);
    start=pos;
}else{
    start=pos-1;
}
ADD REPLY
1
Entering edit mode
7.0 years ago
Ram 43k

I'd say convention is the principal reason. Ideally, you're looking at semantics where the base REF, found at position POS is substituted by the base ALT. In Indels without specifying a base for REF, this semantic breaks down.

BTW, I'm sure this has been often debated - I recall an article I read from way back when on the flaws in the VCF format, including the inconsistent representation of indels.

ADD COMMENT

Login before adding your answer.

Traffic: 2451 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6