Question

how to merge gbz files

0

Entering edit mode

4 months ago

lushjia • 0

I’m working with a very large PGGB GFA file, and building a GBZ from it takes an extremely long time. I’m considering splitting the GFA by chromosome and converting each chromosome to GBZ separately.

Is there a way to merge the resulting per-chromosome GBZ files into a single GBZ file afterward? Or are there some other ways to speed up this gfa to gbz process?

vg • 661 views

ADD COMMENT • link updated 4 months ago by Jouni Sirén ▴ 800 • written 4 months ago by lushjia • 0

score 2 · Answer 1 · 2025-07-10

You can merge GBZ files for individual chromosomes, assuming that no node-to-segment translation was created during GBZ construction. That means all per-chromosome graphs must use non-overlapping integer identifiers for the nodes, and no node is longer than 1024 bp. (You can increase the limit from 1024 bp, but then many vg tools cannot use the graph.)

First you extract the GBWT indexes from the per-chromosome graphs:

vg gbwt -o chrN.gbwt -Z chrN.gbz

Then you merge them using the fast algorithm:

vg gbwt -o merged.gbwt --fast chr1.gbwt chr2.gbwt chr3.gbwt ...

Then you create a version of the whole-genome GFA with nodes and edges but no paths and use it to build the merged GBZ:

grep -v "^P" whole-genome.gfa > graph-only.gfa
vg gbwt -x graph-only.gfa -g merged.gbz --gbz-format merged.gbwt

The actual reason for the slow construction is probably the graph, which breaks some of the assumptions GBWT makes. In particular, there may be heavily collapsed regions where some nodes have a large number of neighbors and most haplotypes visit the nodes a large number of times. That issue with PGGB graphs was already mentioned in the paper, as was the solution. It would require a few weeks of data structure work, which I still haven't found the time to do.

Additionally, the same issue that makes the construction slow with PGGB graphs also makes GBZ slow in those regions. The solution is similar, but it requires even more work.