Question: Why do indel variants need a leading base according to the VCF spec?
0
gravatar for William
17 months ago by
William4.3k
Europe
William4.3k wrote:

According to the VCF spec indel variants need a leading non polymorph nucleotide for all alleles.

https://samtools.github.io/hts-specs/VCFv4.2.pdf

For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event;

Why is this? Other systems accept or require variants without the leading base.

Simple deletion:

VCF: CAGTAGTGA/C
Other: AGTAGTGA/-

Simple insertion:

VCF: C/CAGTAGTGA 
Other: -/AGTAGTGA

The starting positions of the variants of course also differs by 1 between these two notation forms.

Is one notation form better than the other?

Is lossless conversion always possible between these two annotation forms?

I.e. just add or remove the first leading base and increment of decrease POS by 1? Is there a script/tool/code snippet that already does this?

vcf • 565 views
ADD COMMENTlink modified 17 months ago by RamRS18k • written 17 months ago by William4.3k
1

There's no simple bidierctional conversion due to the fact that indels at the beginning of a contig need a different base (and position) than those not at the beginning, and you need to know the reference, as well. But VCF->Other should be lossless. Although, with multi-allelic sites all on the same line (say, an insertion and a deletion at the same location), I don't even know what you're supposed to do, or if it can even be represented. In short, the VCF version of indels is a weird design that causes a lot of headaches. The "Other" version is much easier to work with.

That said, assuming one variant per line, and assuming VCF->Other (which is much easier than the reverse), here's a snippet of my code (note that my internal format is 0-based half-open):

if(alt.length!=reflen && alt.length>0){
    alt=Arrays.copyOfRange(alt, 1, alt.length);
    start=pos;
}else{
    start=pos-1;
}
ADD REPLYlink modified 17 months ago • written 17 months ago by Brian Bushnell16k
1
gravatar for RamRS
17 months ago by
RamRS18k
Houston, TX
RamRS18k wrote:

I'd say convention is the principal reason. Ideally, you're looking at semantics where the base REF, found at position POS is substituted by the base ALT. In Indels without specifying a base for REF, this semantic breaks down.

BTW, I'm sure this has been often debated - I recall an article I read from way back when on the flaws in the VCF format, including the inconsistent representation of indels.

ADD COMMENTlink modified 17 months ago • written 17 months ago by RamRS18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1582 users visited in the last hour