Hello VG Team,
I have been contemplating whether there is a more efficient approach to perform population-level Structural Variants (SVs) calling using the VG calling pipeline.
Based on my understanding, the VG calling pipeline consists of the following steps: 1) vg construct to create the graph.vg; 2) vg index to generate indexes xg and gcsa; 3) vg map to produce the mapping file gam; 4) vg augment for creating the augmented graph; 5) index and map execution against the augmented graph; 6) finally, conducting vg call.
The issue is that this pipeline takes at least 7.5 days for just one sample. As I need to process multiple samples, I am considering randomly selecting one sample from each group to execute the pipeline, and then use the combined VCF from vg call to create a new graph.vg. The remaining samples can then use this newly generated graph to only execute steps 1), 2), 3), and 6).
I would appreciate your input on whether this plan is appropriate. Are there any other methods that I may not be aware of that can efficiently solve this problem?
Additionally, I am curious about the quality of the new SVs generated in step 3). If the quality is not satisfactory, I am thinking of skipping the vg augment step for all the samples.
Thank you for your assistance.
Best regards, Maxine
These days most users are opting for
vg giraffeinstead ofvg mapfor short read mapping. Its speed is closer to what people expect from a tool likebwa mem. You can also get away without augmenting the graph if your primary interest is structural variants. For small variants, you can get better performance by projecting the graph mappings to a linear reference usingvg surjectand then usingDeepVariant(you can see this analysis in the main HPRC paper if you want a model).I am highly interested in using giraffe, but I encountered two challenges: my VCF used to construct the graph is unphased, and more importantly, the vg autoindex -giraffe process consistently fails due to out-of-memory issues. Therefore, I have the following questions:
vg giraffeon unphased inputs to give a firm yes or no. My guess is that it depends on the variant density. If your graph frequently has several variants within the span of a read length, you'll probably hurt for the lack of phasing. If not, it might be okay.vg giraffenorvg autoindexare particularly light on memory use, but I would be surprised ifvg autoindexused much more memory thanvg giraffe. If you're interested in pursuing a manual pipeline, you can make a graph withvg construct -aand then index it as a GBWT withvg gbwt. There are some suggestions for how to do that in this guide.