Question

Best practice to merge per sample variant calls from personalized pangenomes

1

Entering edit mode

3 months ago

Tobias ▴ 40

Hi everyone,

I am aiming to identify and genotype structural variants in a cattle cohort by mapping short reads against a Minigraph-Cactus pangenome graph. To do so, I am following the personalized pangenome approach, as I do not want to exclude low frequency variants/nodes and using the full graph is not computationally feasible. Specifically, I run the pipeline vg haplotypes -> vg giraffe -> vg filter -> vg pack -> vg snarls -> vg call. As a next step, I would like to merge the resulting per-individual VCF files into a monolithic file for follow-up analyses.

With the "traditional" approach, where variant calling for each individual was done using the same set of common snarls (i.e., those covered by a certain number of haplotypes), comparing individuals was more or less trivial through vg call's -a flag, which would instruct it to genotype all snarls for each individual, including reference calls.

However, because I use the haplotype-sampled graphs and associated snarls, the sets of snarls each individual is genotyped for are not identical. Therefore, there will be a large number of variant sites present in one individual's VCF but not in the other's and vice versa, e.g., when a snarl only appears in one of the haplotype-sampled graphs or when one individual has the reference allele there. When merging such VCFs, these sites will erroneously be populated with missing entries ("."; true call might be reference) for one sample or the other given that reference call information is not present in those VCFs. Even if I use the -a flag, I would still be presented with this issue because of the non-identical set of snarls.

I could probably check for each individual which genomic positions have decent sequencing depth and then replace those positions' missing entries with reference calls, but I was wondering whether there is a more straightforward way. Do you have any best practice recommendations of how to address the issue?

Best,

Tobias

graph vcf haplotype-sampling vg personalized-pangenome • 1.3k views

ADD COMMENT • link updated 3 months ago by GenoMax 154k • written 3 months ago by Tobias ▴ 40

score 2 · Answer 1 · 2025-07-29

2

Entering edit mode

3 months ago

Jouni Sirén ▴ 800

I don't know if there are best practices for that kind of work yet. However, the haplotype-sampled graph is a subgraph of the original graph. Any alignments in it are also valid in the original graph. That means you should be able to use the sampled graph just for vg giraffe and the original graph for all subsequent commands.

But there could be issues that are not immediately obvious. I haven't tried this myself, and I'm not aware of anyone else doing it either.

ADD COMMENT • link 3 months ago by Jouni Sirén ▴ 800

1

Entering edit mode

Dear Jouni,

Thanks so much for the quick reply!

Unfortunately, I had already tried your suggestion and vg call got stuck. That is, I aligned against the subgraph, and ran vg pack and vg snarls using the full graph. I then submitted the command vg call $gbz -a -k $pack -t $threads -r $snarls -s $sample --progress.

I get the following output:

[vg call]: Loading [...]/Eur1.gbz
[vg call]: Loaded graph 
[vg call]: GBZ input detected 
[vg call]: You can restrict the search to GBZ haplotypes, often to the benefict of speed and accuracy, with the -z option 
[vg call]: Applying overlays if necessary (ie input not in XG format) 
[vg call]: Applied overlays 
[vg call]: Loading snarls from [...]/SAMEA6272096.Eur1.fullGraph.snarls 
[vg call]: Loaded snarls
[vg call]: Loading pack file [...]/SAMEA6272096.Eur1.fullGraph.pack 
[vg call]: Loaded pack file 
[vg call]: Computing coverage statistics 
[vg call]: Computed coverage statistics 
[vg call]: Calling top-level snarls 
[vg call]: Processed 100004 top-level snarls [vg call]: Processed 200003 top-level snarls
[...]
[vg call]: Processed 32300005 top-level snarls [vg call]: Processed 32400005 top-level snarls

After this, there is no progress anymore. The algorithm gets stuck some time after the 32,400,005th top-level snarl (i reran multiple times but it always gets stuck there). In total, the graph contains 35,972,027 such snarls (the full graph contains 49 relatively diverse assemblies).

Would you have any ideas how to troubleshoot this?

Best and thanks again, Tobias

ADD REPLY • link 3 months ago by Tobias ▴ 40

2

Entering edit mode

How long did you wait after the vg call command got stuck? And does that also happen if the reads were aligned to the original graph rather than the personalized graph?

Minigraph–Cactus often creates snarls that are megabases or even tens of megabases in size. Processing them can take a long time, and large snarls also cause issues with haplotype sampling and pipelines that express the output relative to a linear reference.

We've had some success in breaking large snarls by clipping edges that cross long distances according to reference coordinates. The approach is still experimental. For example, the following commands should clip edges jumping more than 1 Mbp:

vg convert -t $THREADS input.gbz > input.vg
vg clip -D 1000000 -c 1000 -m 1000 -P $REFERENCE -t $THREADS input.vg > output.vg
vg convert -t $THREADS --gfa-out output.vg > output.gfa
vg gbwt --num-jobs $THREADS -g output.gbz --gbz-format -G output.gfa

You first need to convert the graph from GBZ to a mutable format for vg clip, and the mutable graph may need a lot of memory. Once you have the output graph, you first need to convert it to GFA before you can build a GBZ graph.

And then the output graph is the one you should use for haplotype sampling (and everything else), which means you need to rebuild the .hapl file (and therefore the .ri and .dist files).

Edit: I noticed that you used the term "full graph". By Minigraph–Cactus terminology, the full graph is the one that contains unaligned regions (such as centromeres). Such graphs are not suitable for most vg tools. The graph you should be using as the "original graph" is the default / clipped graph, where unaligned regions longer than a threshold (e.g. 10 kbp) have been clipped.

ADD REPLY • link 3 months ago by Jouni Sirén ▴ 800

1

Entering edit mode

Thanks for the suggestion! And please excuse the ambiguous terminology. When I was referring to the "full" graph, I was actually speaking about the clipped one output by minigraph-cactus, i.e., "full" as opposed to personalized. That was misleading I agree.

When I run vg call for a single sample of depth ~15x, it gets stuck at the position shown above after ~30 min (so, it is quite fast). I had it continue for 2 days after that but there was no progress. I was not able to succesfully align to the original graph (without filtering it) as the algorithm also gets stuck after like 1-2 days of runtime.

Anyway, I think the issue is solved now. I first tried your suggestion but the problem persisted (vg clip -D 1000000 removed exactly 25 edges but apparently not the causal one). What worked in the end was using vg call with the option -C 100000, i.e., skipping snarls that are larger than 100,000 bp. vg call -C 1000000 (excluding snarls of size 1,000,000 bp) did not work by the way. So, the problem seems to be caused by large snarls that are somewhere between 100 kbp and 1 Mbp in size.

Best,

Tobias

ADD REPLY • link 3 months ago by Tobias ▴ 40

0

Entering edit mode

Please use Add Reply when responding to existing comments. SUBMIT ANSWER should be used only for new answers to original question.

If the solution solved your question please consider accepting it (green checkmark) to provide closure to this thread.

ADD REPLY • link 3 months ago by GenoMax 154k