Greetings,
I'm trying to get a sequence for each sample in a multi-sample vcf by combining a reference sequence with the variants from the vcf.
The problem is that there are a few variants that overlap with indels. Some are correctly (I believe) denoted by the "*" symbol as per vcf 4.3 specification, other are not. There are lines where the ALT allele is "*" but does not seem to overlap with any other variant. Below is an example of such an entry. The variant at position 14586 is called as * when there are no overlaps with either the previous or the subsequent variant in this sample.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CAR10
MT 14559 . TAAA TA,T,TAA 4.35685e+06 . AC=1,0,0
MT 14586 . AATATATATATATATATATATAT AATATATAT,AATATATATAT,*,AATATATATATATAT,AATATATATATATATAT,AATAT 881243 . AC=0,0,1,0,0,0
MT 15129 . C A 2.69538e+06 . AC=1
Is this a problem with the vcf file or am I misunderstanding the format?
I'm using bcftools to do the following:
- susbet the multisample vcf to get single sample vcf (bcftools view)
- normalise (bcftools norm)
- combine reference fasta and the vcf (bcftools consensus)
bcftools consensus skips the variants that overlap with the previous variant. I believe this is done by comparing the position of a variant with the end position of the previous variant.
In the case above it does not skip the variant at position 14586 but instead includes a "*" symbol into the output fasta file. Should it just use the reference instead?
Any help would be greatly appreciated.