Question

How to parallelize running vg autoindex to create graph references from VCFs with vg?

0

Entering edit mode

8 weeks ago

Ricky ▴ 10

Hi, I'm running vg autoindex --workflow giraffe to try to create a graph reference using a VCF of variants I have that I later want to align some reads to with giraffe. I tested it out with a few variants on one chromosome, and this took a very long time with a lot of memory. I was wondering if it was possible to parallelize constructing a full graph reference by chromosome and merging the results into one reference at the end.

I tried looking through vg combine --help but it seems to only support combining .vg files rather than the .gbz, .dist, and .min files produced using the vg autoindex command.

What's the best way to scale up creating a graph reference from a VCF using the tool? Would combine still work, or is there another subtool to help here?

Thanks!

vg • 322 views

ADD COMMENT • link 8 weeks ago by Ricky ▴ 10

score 2 · Accepted Answer · 2024-02-29

2

Entering edit mode

8 weeks ago

Jouni Sirén ▴ 360

vg autoindex tries to parallelize everything within the specified thread and (approximate) memory bounds. Some parts will be parallelized by chromosome, while other parts are inherently sequential in the current implementation.

How much time and memory vg autoindex is using, and for what? I think building the indexes for a 1000GP graph (~100 million variants, ~5000 haplotypes) should take a day on a server with 32 cores and 256 GB memory. Starting from a HPRC GFA (~100 million nodes, 90 haplotypes), a laptop with 12 cores and 96 GB memory should need a couple of hours.

ADD COMMENT • link 8 weeks ago by Jouni Sirén ▴ 360

0

Entering edit mode

Thanks, I didn't realize autoindex was already running in parallel. I increased the memory for my task and it was able to complete.

ADD REPLY • link 8 weeks ago by Ricky ▴ 10