Hi everyone,
I am aiming to identify and genotype structural variants in a cattle cohort by mapping short reads against a Minigraph-Cactus pangenome graph. To do so, I am following the personalized pangenome approach, as I do not want to exclude low frequency variants/nodes and using the full graph is not computationally feasible. Specifically, I run the pipeline vg haplotypes -> vg giraffe -> vg filter -> vg pack -> vg snarls -> vg call. As a next step, I would like to merge the resulting per-individual VCF files into a monolithic file for follow-up analyses.
With the "traditional" approach, where variant calling for each individual was done using the same set of common snarls (i.e., those covered by a certain number of haplotypes), comparing individuals was more or less trivial through vg call's -a flag, which would instruct it to genotype all snarls for each individual, including reference calls.
However, because I use the haplotype-sampled graphs and associated snarls, the sets of snarls each individual is genotyped for are not identical. Therefore, there will be a large number of variant sites present in one individual's VCF but not in the other's and vice versa, e.g., when a snarl only appears in one of the haplotype-sampled graphs or when one individual has the reference allele there. When merging such VCFs, these sites will erroneously be populated with missing entries ("."; true call might be reference) for one sample or the other given that reference call information is not present in those VCFs. Even if I use the -a flag, I would still be presented with this issue because of the non-identical set of snarls.
I could probably check for each individual which genomic positions have decent sequencing depth and then replace those positions' missing entries with reference calls, but I was wondering whether there is a more straightforward way. Do you have any best practice recommendations of how to address the issue?
Best,
Tobias
Dear Jouni,
Thanks so much for the quick reply!
Unfortunately, I had already tried your suggestion and
vg callgot stuck. That is, I aligned against the subgraph, and ranvg packandvg snarlsusing the full graph. I then submitted the commandvg call $gbz -a -k $pack -t $threads -r $snarls -s $sample --progress.I get the following output:
After this, there is no progress anymore. The algorithm gets stuck some time after the 32,400,005th top-level snarl (i reran multiple times but it always gets stuck there). In total, the graph contains 35,972,027 such snarls (the full graph contains 49 relatively diverse assemblies).
Would you have any ideas how to troubleshoot this?
Best and thanks again, Tobias
How long did you wait after the
vg callcommand got stuck? And does that also happen if the reads were aligned to the original graph rather than the personalized graph?Minigraph–Cactus often creates snarls that are megabases or even tens of megabases in size. Processing them can take a long time, and large snarls also cause issues with haplotype sampling and pipelines that express the output relative to a linear reference.
We've had some success in breaking large snarls by clipping edges that cross long distances according to reference coordinates. The approach is still experimental. For example, the following commands should clip edges jumping more than 1 Mbp:
You first need to convert the graph from GBZ to a mutable format for
vg clip, and the mutable graph may need a lot of memory. Once you have the output graph, you first need to convert it to GFA before you can build a GBZ graph.And then the output graph is the one you should use for haplotype sampling (and everything else), which means you need to rebuild the
.haplfile (and therefore the.riand.distfiles).Edit: I noticed that you used the term "full graph". By Minigraph–Cactus terminology, the full graph is the one that contains unaligned regions (such as centromeres). Such graphs are not suitable for most vg tools. The graph you should be using as the "original graph" is the default / clipped graph, where unaligned regions longer than a threshold (e.g. 10 kbp) have been clipped.
Thanks for the suggestion! And please excuse the ambiguous terminology. When I was referring to the "full" graph, I was actually speaking about the clipped one output by minigraph-cactus, i.e., "full" as opposed to personalized. That was misleading I agree.
When I run
vg callfor a single sample of depth ~15x, it gets stuck at the position shown above after ~30 min (so, it is quite fast). I had it continue for 2 days after that but there was no progress. I was not able to succesfully align to the original graph (without filtering it) as the algorithm also gets stuck after like 1-2 days of runtime.Anyway, I think the issue is solved now. I first tried your suggestion but the problem persisted (
vg clip -D 1000000removed exactly 25 edges but apparently not the causal one). What worked in the end was usingvg callwith the option-C 100000, i.e., skipping snarls that are larger than 100,000 bp.vg call -C 1000000(excluding snarls of size 1,000,000 bp) did not work by the way. So, the problem seems to be caused by large snarls that are somewhere between 100 kbp and 1 Mbp in size.Best,
Tobias
Please use
Add Replywhen responding to existing comments.SUBMIT ANSWERshould be used only for new answers to original question.If the solution solved your question please consider accepting it (green checkmark) to provide closure to this thread.