Hi vg team,
I'm curious whether vg call can accommodate some level of fuzzy alignment, identifing this fuzzy alignment as a known SV within the pangenome. If this is possible, which parameters can be adjusted to set the threshold?
Considering a large SV, spanning hundreds or even thousands of bases, it's unlikely to be identical at every base if it first appeared in a population a long time ago due to genetic drift. How would vg call handle a sequence that mostly aligns to a path except for a few bases?
Maxine
You said:
Does it imply that in a variant calling pipeline that doesn't perform augmentation, it cannot identify new loci, however, it has the capability to assign new allele. For instance, for a bi-allelic locus (0/1) in ref pangenome, a sequence that doesn't match either allele 0 or 1 will be assigned to 2. Is that what you are suggesting?
The criterion that
vg call
uses to assign alleles is exact sequence identity. If the graph has nested small variants within the SV, they can lead to distinct alleles for the SV in the VCF thatvg call
creates, regardless of whether you are using augmentation.What about a sequence that is 99% identical to a certain path in the graph? Despite being so similar, there are a few base mismatches. How would vg handle this situation?
They would be reported as separate alleles
That's great news. May I ask if there are any rules for determining this sequence? For instance, a sequence with less than 80% similarity is considered a mismatch, while one with more than 80% similarity is assigned a separate allele symbol. Perhaps the rules are complex, but if there are any documents or articles that mention this, please let me know. Thank you!
Ah, sorry, I think I misunderstood your question. I think there are two situations that we need to distinguish, and I'm not fully sure which one you expect:
vg call
will only call the reference allele or the SV allele without the nested variants, so the variant will appear to be biallelic. If you want to call the nested variants, you can usevg augment
to discover small variants from the reads. If you augment, the site can be reported as a multiallelic SV, where some alleles have very similar sequences.The
vg call
algorithm was originally published in this paper, but I don't think there's detailed documentation there. You might also be interested in this tutorial.