Question

vg gbwt missing sample and haplotype metadata when created using minigraph-cactus with vg construct

0

Entering edit mode

10 weeks ago

cassiwatt ▴ 10

Hello, I've found strange behavior surrounding the vg's gbwt index and wanted to report it and potentially figure out why it occurs.

When I create a genome graph using minigraph-cactus, then use vg construct to create a vg graph and gbwt index, the sample and haplotype data is missing.

Graph creation:

cactus-minigraph ./js graph.txt graph-base.sv.gfa --reference ref1 ref2 ref3
cactus-graphmap ./js graph.txt graph-base.sv.gfa graph-base.paf --outputFasta graph-base.sv.gfa.fa --reference ref1 ref2 ref3
cactus-graphmap-split ./js graph.txt graph-base.sv.gfa graph-base.paf --outDir ./chroms --reference ref1 ref2 ref3
cactus-align ./js ./chroms/chromfile.txt graph-base-chrom-alignments --batch --pangenome --reference ref1 ref2 ref3 --outVG
cactus-graphmap-join ./js --workDir . --vg ./*.vg --hal ./*.hal --outDir . --outName graph --reference ref1 ref2 ref3 --vcfReference ref1 ref2 ref3 --gfa --gbz --vcf --clip 10000 --filter 0
vg construct -p -t 96 -N -r ref1.fasta.gz -v graph.vcf > graph.vg
vg index -p -T -t 96 -b . -x graph.xg -G graph.gbwt graph.vg

Missing haplotypes/samples:

vg gbwt -S -H -L -M graph.gbwt
11 paths with names, 1 samples with names, 1 haplotypes, 11 contigs with names
1
_gbwt_ref

In the above case, the gbwt should include 32 haplotypes/samples, but '_gbwt_ref' is returned as the only sample.

However, when the vg graph indices are created during the final graph construction step in the minigraph-cactus creation pipeline --giraffe clip, this same behavior doesn't occur:

cactus-minigraph ./js graph.txt graph-base.sv.gfa --reference ref1 ref2 ref3
cactus-graphmap ./js graph.txt graph-base.sv.gfa graph-base.paf --outputFasta graph-base.sv.gfa.fa --reference ref1 ref2 ref3
cactus-graphmap-split ./js graph.txt graph-base.sv.gfa graph-base.paf --outDir ./chroms --reference ref1 ref2 ref3
cactus-align ./js ./chroms/chromfile.txt graph-base-chrom-alignments --batch --pangenome --reference ref1 ref2 ref3 --outVG
cactus-graphmap-join ./js --workDir . --vg ./*.vg --hal ./*.hal --outDir . --outName graph --reference ref1 ref2 ref3 --vcfReference ref1 ref2 ref3 --gfa --gbz --vcf --clip 10000 --filter 0 --giraffe clip
vg gbwt -Z graph.gbz -o graph.gbwt

See:

vg gbwt -S -H -L -M graph.gbwt
sagemaker-user@default:~/data$ vg gbwt -S -H -L -M hapgraphv1.0-clip10000.gbwt
52354 paths with names, 32 samples with names, 32 haplotypes, 311 contigs with names
32
line1
line2
...
line32

I'm not sure if this is intended behavior, but I wanted to report it in case others encounter it.

construct gbwt vg minigraph-cactus • 9.2k views

ADD COMMENT • link updated 10 weeks ago by Jouni Sirén ▴ 770 • written 10 weeks ago by cassiwatt ▴ 10

score 2 · Answer 1 · 2025-08-22

If you want to build a graph from a reference and a VCF file, we recommend using vg autoindex. The best practices depend on the data you have and what you are trying to achieve, and it's easy to miss a critical step if you are not familiar with the vg data model.

In your case, you missed option -a in vg construct, which includes the variants as paths in the graph. Without those paths, vg cannot generate haplotype paths from a VCF file. (It also looks like you are using an old version of vg, as you are trying to build the GBWT with vg index instead of vg gbwt.)

That said, you should not use a VCF file as an intermediate step when working with pangenome graphs. The graph built from a reference and a set of variants is not unique outside trivial cases (only simple biallelic variants separated by at least one reference base + some technical assumptions). There are no guarantees that you get the same graph back if convert a graph to a VCF file and then build a new graph from that file.

Additionally, even if you get the same graph, there are no guarantees that the haplotypes are preserved. VCF is a variant-centric format, and it does not specify how to determine the haplotype sequences in regions with overlapping variants. Different tools have their own interpretations that often lead to different sequences.

Therefore, if you are using Minigraph–Cactus for building the graph, you should let it build the indexes as well. Or you can take the GBZ graph from Minigraph–Cactus and use vg autoindex for building the indexes.