Question

Speed up vg call ?

1

Entering edit mode

16 months ago

colindaven 6.3k

Dear all,

has anyone found a good way of speeding up vg call by chunking or another method ?

Current chunking seems to lead to a very modest speedup of 0-20% so maybe is not the right approach.

Specifically, I have aligned with vg giraffe and used the following code to chunk the resulting GAM file into sets of 100000 reads.

Commands from a nextflow script

vg chunk -t $task.cpus --gam-split-size $params.gam_split_size -a $gam

vg pack -t $task.cpus -x $xg -g $chunked_gam -Q5 -o ${prefix}.aln.pack
vg call -t $task.cpus $xg -k ${prefix}.aln.pack --min-support $params.min_support -s $sample_name > ${prefix}.vcf

I've read lots of docs but am not sure what's up to date

There is a complicated vg_toil script here, however I don't know if this is up to date (from 2020) so I'm a bit wary

vg_toil_script

Thanks

vg_team vg • 1.4k views

ADD COMMENT • link updated 8 months ago by Maxine ▴ 40 • written 16 months ago by colindaven 6.3k

0

Entering edit mode

vg call took me 3 days and failed ultimately due to time-limit (the job has duration as I submitted the job by slurm). Is this considered to be a normal occurrence?

The command I utilized is :

vg call graph.gbz -k ${pack_file} -r ${snarls_file} -t 32 > ZYZ288A.giraffe.vcf

The log was silent and the output vcf was empty.

The the sizes of input files are:

-rw-r----- 1 maxine91 maxine91 4.6G Jul 21 01:05 div.12bufo.giraffe.gbz
-rw-r----- 1 maxine91 maxine91 9.5M Jul 25 02:01 div.12bufo.giraffe.gbz.snarls
-rw-r----- 1 maxine91 maxine91 3.8G Jul 25 00:56 ZYZ288A.gbz.pack
-rw-r----- 1 maxine91 maxine91  88G Jul 22 02:05 ZYZ288A.giraffe.mapped.gam

ADD REPLY • link 8 months ago by Maxine ▴ 40

1

Entering edit mode

When running vg call on .gbz input, you can often see a major speedup by adding -z to limit it to haplotypes present in the .gbz. For example, this makes it 100s of times faster for the HPRC graphs.

ADD REPLY • link 8 months ago by glenn.hickey ▴ 520

0

Entering edit mode

Thanks, I'll try it. But I also have a vg call process that takes xg, xg.pack, xg.snarls as input. It also run 3 days for nothing happened in its vcf. it seemed that no matter how much time elapsed, it would never finish. I even began to question whether the process was stuck. Is it a normal situation? Is there any method to determine if a process is stuck or not?

update on Aug 2nd:

Currently, I have two instances of the vg call command running, and at intervals of 12 hours, I have been monitoring the memory usage, which has remained virtually unchanged. This further intensifies my concerns that the processes may be stuck. I eagerly await your assistance. Thanks.

ADD REPLY • link 8 months ago by Maxine ▴ 40

0

Entering edit mode

Interesting idea. So my nextflow code is like this at the moment, can I just change the $xg and $gbwt to $gbz, add -z, and get the speedup like shown here ?

Edit - looks like it worked, time reduced on a test 25k arabidopsis example from 4m14 to 2m38. Thanks!

#current code
VG_FULL_TRACEBACK=1
vg pack -t $task.cpus -x $xg -g $gam -Q5 -o ${prefix}.aln.pack
vg call -t $task.cpus $xg -C 100 -k ${prefix}.aln.pack --min-support $params.min_support -a -r $snarls -g $gbwt -s $sample_name > ${prefix}.vcf

#suggested code
VG_FULL_TRACEBACK=1
vg pack -t $task.cpus -x $gbz -g $gam -Q5 -o ${prefix}.aln.pack
vg call -t $task.cpus $gbz -C 100 -k ${prefix}.aln.pack --min-support $params.min_support -a -s $sample_name -r $snarls -z $gbz  > ${prefix}.vcf

ADD REPLY • link 8 months ago by colindaven 6.3k

1

Entering edit mode

-z does not take an argument.

vg call graph.xg -g graph.gbwt should be exactly equivalent to vg call graph.xg -g graph.gbwt. (ie -g should give the same speedup as -z). If you are seeing different runtimes, I suggest double-checking your output.

ADD REPLY • link 8 months ago by glenn.hickey ▴ 520

0

Entering edit mode

to Glenn:

May I inquire if you could provide me with information regarding the species that the VG team has attempted while executing the vg call command, along with the corresponding time and resources expended? This would give me some insight into how I should plan my workflow.

Furthermore, I am contemplating the idea of constructing graphs and calling variants on a per-chromosome basis. Is this theoretically feasible?

Thanks

ADD REPLY • link 8 months ago by Maxine ▴ 40

score 2 · Answer 1 · 2023-03-06

2

Entering edit mode

13 months ago

glenn.hickey ▴ 520

For bigger datasets, make sure to pass in your snarls (computed with vg snarls) with -r. Sometimes, that is most of vg call's runtime. You can also try using -C to avoid huge snarls. In general, unless your graph is extremely complex, you should not need to chunk it up before running vg call.

ADD COMMENT • link 13 months ago by glenn.hickey ▴ 520