Hi all,
The VCF 4.1 file format specification states that the POS field is required. But suppose that you compare two genomes, a reference and an assembly, and find a big insertion in the assembly that you can't map to the reference unambiguously. Let me show an example. Let's denote by signed numbers large conservative regions (synteny blocks).
Reference is: +1 +2 +3 +2 +4 Assembly is: -1 +3 +4 -2 +5 -2
You see that +5 is a unique sequence that is not homologous to any sequence in the reference. But due to rearrangements, it's very hard to find the actual position of +5 in the reference. This situation is very common in bacteria, even within the same species (different strains). What is a proper way to report it in VCF?
P.S. VCF validator from vcftools doesn't permit '.' in the POS column.
I don't think VCF is designed for your use case.
Which solution would you suggest? A custom file format for such cases? Or it could be reasonable to extend VCF format for this?
fastg? http://fastg.sourceforge.net/FASTG_Spec_v1.00.pdf
Thank you, Jeremy, it is a very interesting link!