Both bcftools norm and vt normalization failing for the same variant?
4 months ago
jpuntomarcos ▴ 40

Hi,

I want to left-normalize (5') all genomic variants in my pipeline. But something occurred for the 1:17371287 GAGGT/- variant. If I use this VCF as input:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   17371286    1:17371287_GAGGT/-  TGAGGT  T   .   PASS


The output for both vt normalization and bcftools norm is

1   17371285    1:17371287_GAGGT/-  ATGAGG  A


That is, the variant has been moved 1 pos to the left. However, if we check reference, we see there is no repeat pattern to justify that shift:

It seems that the input VCF, TGAGGT / T, is ambiguous and makes both normalizers consider that the deletion is from the first T to the G (TGAGG) instead of from the G to the last T (GAGGT). Well, I tried to use a more exhaustive variant description as VCF input:

1   17371283    1:17371287_GAGGT/-  ATATGAGGTTTGTCT ATATTTGTCT


However, the result is the same, the variant is again moved to the left:

1   17371285    1:17371287_GAGGT/-  ATGAGG  A


Am I missing something? Any help would be very welcomed :)

Note: Websites refer to rs786202100 indel with both coordinates: 1:17371286-17371290 and 1:17371287-17371291 (ex1, ex2), which makes all a bit more confusing.

4 months ago

For me the behavior looks correct.

Let's check it manual.

This is Sequence we have:

CATATGAGGTTTGTC

The vcf variant description like to do this:

CATA TGAGGT TTGTC --> CATA T TTGTC

bcftools norm reports this:

CAT ATGAGG TTTGTC --> CAT A TTTGTC

So the result is the same, but the last one starts more towards 5'.