Interpreting Gaps at Pos 0 in Terms of VCF
7.3 years ago
pld 5.0k

I'm writing a python script to convert clustal formatted alignments into VCF files. I'm lost on one thing, how to interpret a gap at the start of an alignment:

ENG1-REF-K      ATTTAAGTGAATAGCTTGGCTATCTCACTTCCCCTCGTTCTCTTGCAGAACTTTGATTTT
MERS_EMC_V      ---------------------------------------------CAGAACTTTGATTTT
***************

Based on the VCF format, it seems to assume that there is a base upstream of the deletion. E.g. if I have ACGT and A-GT, the VCF file should be REF: AC, ALT: A. The position of the deletion is 2, but the position of the ALT is 1 according to VCF.

http://samtools.github.io/hts-specs/VCFv4.2.pdf

How are terminal deletions considered in VCF?

clustal vcf alignment msa
7.3 years ago
Zhaorong ★ 1.3k

From the VCF (Variant Call Format) version 4.1 specification (and also the 4.2):

"the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event".

POS = 1

REF = ATTTAAGTGAATAGCTTGGCTATCTCACTTCCCCTCGTTCTCTTGC

ALT = C