How to parallelize running vg autoindex to create graph references from VCFs with vg?
Entering edit mode
7 weeks ago
Ricky ▴ 10

Hi, I'm running vg autoindex --workflow giraffe to try to create a graph reference using a VCF of variants I have that I later want to align some reads to with giraffe. I tested it out with a few variants on one chromosome, and this took a very long time with a lot of memory. I was wondering if it was possible to parallelize constructing a full graph reference by chromosome and merging the results into one reference at the end.

I tried looking through vg combine --help but it seems to only support combining .vg files rather than the .gbz, .dist, and .min files produced using the vg autoindex command.

What's the best way to scale up creating a graph reference from a VCF using the tool? Would combine still work, or is there another subtool to help here?


vg • 308 views
Entering edit mode
6 weeks ago
Jouni Sirén ▴ 360

vg autoindex tries to parallelize everything within the specified thread and (approximate) memory bounds. Some parts will be parallelized by chromosome, while other parts are inherently sequential in the current implementation.

How much time and memory vg autoindex is using, and for what? I think building the indexes for a 1000GP graph (~100 million variants, ~5000 haplotypes) should take a day on a server with 32 cores and 256 GB memory. Starting from a HPRC GFA (~100 million nodes, 90 haplotypes), a laptop with 12 cores and 96 GB memory should need a couple of hours.

Entering edit mode

Thanks, I didn't realize autoindex was already running in parallel. I increased the memory for my task and it was able to complete.


Login before adding your answer.

Traffic: 1877 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6