how to merge gbz files
1
0
Entering edit mode
4 months ago
lushjia • 0

I’m working with a very large PGGB GFA file, and building a GBZ from it takes an extremely long time. I’m considering splitting the GFA by chromosome and converting each chromosome to GBZ separately.

Is there a way to merge the resulting per-chromosome GBZ files into a single GBZ file afterward? Or are there some other ways to speed up this gfa to gbz process?

vg • 661 views
ADD COMMENT
2
Entering edit mode
4 months ago
Jouni Sirén ▴ 800

You can merge GBZ files for individual chromosomes, assuming that no node-to-segment translation was created during GBZ construction. That means all per-chromosome graphs must use non-overlapping integer identifiers for the nodes, and no node is longer than 1024 bp. (You can increase the limit from 1024 bp, but then many vg tools cannot use the graph.)

First you extract the GBWT indexes from the per-chromosome graphs:

vg gbwt -o chrN.gbwt -Z chrN.gbz

Then you merge them using the fast algorithm:

vg gbwt -o merged.gbwt --fast chr1.gbwt chr2.gbwt chr3.gbwt ...

Then you create a version of the whole-genome GFA with nodes and edges but no paths and use it to build the merged GBZ:

grep -v "^P" whole-genome.gfa > graph-only.gfa
vg gbwt -x graph-only.gfa -g merged.gbz --gbz-format merged.gbwt

The actual reason for the slow construction is probably the graph, which breaks some of the assumptions GBWT makes. In particular, there may be heavily collapsed regions where some nodes have a large number of neighbors and most haplotypes visit the nodes a large number of times. That issue with PGGB graphs was already mentioned in the paper, as was the solution. It would require a few weeks of data structure work, which I still haven't found the time to do.

Additionally, the same issue that makes the construction slow with PGGB graphs also makes GBZ slow in those regions. The solution is similar, but it requires even more work.

ADD COMMENT

Login before adding your answer.

Traffic: 3143 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6