Question

population-level stratagy of vg call

1

Entering edit mode

9 months ago

Maxine ▴ 40

Hello VG Team,

I have been contemplating whether there is a more efficient approach to perform population-level Structural Variants (SVs) calling using the VG calling pipeline.

Based on my understanding, the VG calling pipeline consists of the following steps: 1) vg construct to create the graph.vg; 2) vg index to generate indexes xg and gcsa; 3) vg map to produce the mapping file gam; 4) vg augment for creating the augmented graph; 5) index and map execution against the augmented graph; 6) finally, conducting vg call.

The issue is that this pipeline takes at least 7.5 days for just one sample. As I need to process multiple samples, I am considering randomly selecting one sample from each group to execute the pipeline, and then use the combined VCF from vg call to create a new graph.vg. The remaining samples can then use this newly generated graph to only execute steps 1), 2), 3), and 6).

I would appreciate your input on whether this plan is appropriate. Are there any other methods that I may not be aware of that can efficiently solve this problem?

Additionally, I am curious about the quality of the new SVs generated in step 3). If the quality is not satisfactory, I am thinking of skipping the vg augment step for all the samples.

Thank you for your assistance.

Best regards, Maxine

vg • 1.1k views

ADD COMMENT • link 9 months ago by Maxine ▴ 40

0

Entering edit mode

These days most users are opting for vg giraffe instead of vg map for short read mapping. Its speed is closer to what people expect from a tool like bwa mem. You can also get away without augmenting the graph if your primary interest is structural variants. For small variants, you can get better performance by projecting the graph mappings to a linear reference using vg surject and then using DeepVariant (you can see this analysis in the main HPRC paper if you want a model).

ADD REPLY • link 9 months ago by Jordan M Eizenga ▴ 460

0

Entering edit mode

I am highly interested in using giraffe, but I encountered two challenges: my VCF used to construct the graph is unphased, and more importantly, the vg autoindex -giraffe process consistently fails due to out-of-memory issues. Therefore, I have the following questions:

Regarding the unphased VCF, I am curious about the performance of vg giraffe compared to vg map, especially in terms of variant calling quality. Can vg giraffe deliver similar performance despite using an unphased VCF?
As for the out-of-memory problem during vg autoindex -giraffe, are there any solutions available for enabling whole-genome autoindexing for giraffe, similar to the guidelines provided in the GitHub Wiki under Working with a whole genome variation graph?

ADD REPLY • link 9 months ago by Maxine ▴ 40

1

Entering edit mode

I'm not sure we've experimented enough with vg giraffe on unphased inputs to give a firm yes or no. My guess is that it depends on the variant density. If your graph frequently has several variants within the span of a read length, you'll probably hurt for the lack of phasing. If not, it might be okay.
Are you working in a constrained memory environment? Neither vg giraffe nor vg autoindex are particularly light on memory use, but I would be surprised if vg autoindex used much more memory than vg giraffe. If you're interested in pursuing a manual pipeline, you can make a graph with vg construct -a and then index it as a GBWT with vg gbwt. There are some suggestions for how to do that in this guide.

ADD REPLY • link 9 months ago by Jordan M Eizenga ▴ 460

score 2 · Accepted Answer · 2023-07-24

2

Entering edit mode

9 months ago

glenn.hickey ▴ 520

vg augment was designed for small variants and has not been tested on SVs. So your only feasible option is to genotype variants already in the graph using vg call on the original (un augmented) graph. There is currently no way to call novel SVs (ie variants not in the graph) with vg.