2.5 years ago
cmirchan

Hello vg-team,

I have a graph that I created and indexed using:

vg construct -v vars -r ref -a >graph.vg
vg index -x graph.xg graph.vg
vg index -G graph.gbwt -v vars graph.vg


The VCF used for construction has phased genotypes for all 7 chromosomes, so I would expect 14 haplotype threads. However vg paths reveals many more than that, 945.

 vg paths -g graph.gbwt -x graph.xg -E
...


I see there are two 'main' threads:

_thread_sample_contig_0_x


What are the other threads? And what does the 'x' represent? Are they just parts of the collective thread?

2.5 years ago
glenn.hickey

Ambiguities, conflicts or missing data in the phasing information in the VCF will cause the haplotype threads to be broken up. Adding the -P option to your index -G command to force phasing at unphased genotypes may resolve this.

I remade my index with the -P option, but still resulted with 945 paths. Is there anything else I could try?

Sometimes haplotypes contain alternate alleles of overlapping variants that make no sense together (under the vg interpretation of the VCF). By default, this causes a phase break in GBWT construction. With option -o, the construction will use the reference allele for the variant that occurs later in the file in such cases. Together with -P, this option will guarantee haplotype paths spanning the entire contig. However, in some cases the paths will end up using edges that do not exist in the graph.

3 months ago

Hello, i'm letting this answer here for people in futur who may have the same problem.

i solved it by adding '' --discard-overlaps --force-phasing " arguments to the GBWT construction as i had unphased VCF file (Documentation here)

The vg paths then showed 20 haplotypes for my 10 samples