vg augment/call - Excessively high memory usage for some chromosomes
1
1
Entering edit mode
14 months ago
Andrew ▴ 20

I'm trying to do whole-genome calling with vg, using altered versions of the sample scripts provided on the github's wiki pages - I initially tried calling the whole genome without chunking and ran into memory issues, even when providing 256 GB of memory (somewhat understandable at least, considering it's trying to process the whole thing in one go). However, when I chunked my graph with vg chunk and attempted to process the chunks separately, they seem to take wildly different amounts of memory - for example, chr2 and chr3 took relatively small amounts of memory (10-12 GB), but chr1 ran out of memory when provided up to 160 GB. The same happened with about half the chromosome chunks , repeated out-of-memory crashes even as I raised the memory allocations (confirmed with chr1,10,12,13,14,15,17,20 at 160 GB each before cancelling the rest of the attempt). I'm testing with 256 GB just to see what happens, but... it doesn't seem like it should take this many resources to run, especially after it's been divided up.

Here's what it looks like when I check the failed job with seff:

[asherrar@cedar1 scratch]$ seff 60828429_13
Job ID: 60850336
Array Job ID: 60828429_13
Cluster: cedar
User/Group: asherrar/asherrar
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 2-14:35:33
CPU Efficiency: 36.61% of 7-02:59:12 core-walltime
Job Wall-clock time: 05:20:36
Memory Utilized: 157.35 GB
Memory Efficiency: 98.34% of 160.00 GB

I'm not sure why this is happening, if it's just a product of my input data's complexity or something else entirely - any suggestions on what I could do differently would be appreciated. I'm using a graph genome I built from the T2T-CHM13 v2.0 reference and the Simons Genome Diversity Project variant calls against it (52 GB vcf.gz dataset), and calling a set of PacBio (I think) long reads on that graph (27 GB fasta). The pipeline I used is available here

Correction: the bulk of the issue is in the vg augment step - not sure how to optimize it, and every other part of the vg variant calling pipeline is dependent on that.

vg-toolkit vg-augment vg vg-call vgteam • 2.4k views
ADD COMMENT
0
Entering edit mode

I had problems with vg call and chunking speedup here: Speed up vg call ?

I didn't analyze memory usage though, but the chunking process is nowhere near as efficient as reference based variant calling on a chunked bam file.

It seems there is a great need for efficient (and multisample) read mapping and SNP and SV calling tools for graph genomes, and I think we will see a lot of progress in the next few years.

Why did you use vg map as opposed to GraphAligner, out of interest?

ADD REPLY
0
Entering edit mode

Yeah, I expect it's not a particularly efficient process, but this is a bit absurd.

Speaking of other tools, I did see a paper on vg SNP-Aware, but it seemed to be running on a much older version of vg - and if memory serves, one that was before several formatting tweaks that might make my newer graphs and intermediate files incompatible with it. The main reason I'm trying to do everything inside vg is just for consistency and because it seemed like a major tool worth learning for pangenomics. If there's better and faster alternatives, I'm all ears - this is far from my expertise and I'd like to learn. Definitely will take a peek at GraphAligner and see if it works for my needs.

ADD REPLY
0
Entering edit mode

Do you happen to know any alternative graph variant callers, by chance? I'm having trouble finding any that aren't using old versions of vg (like SNP-Aware).

ADD REPLY
0
Entering edit mode

No, I think there aren't many SNP callers available yet - I've seen people make custom callers using the vg packfile. Another future mapper for short reads might be dragmap (hardware), but the software version does not map against pangenomes yet.

I've had poor results (slow calling, high RAM usage) on shorter chr (max 75 mb) than human chr1 but have suspended this for now.

I think I'll concentrate on building a pangenome with SVs only, not PGGB pangenomes, and mapping and genotyping (not calling) vs this.

ADD REPLY
0
Entering edit mode

I'm currently playing around with long reads and consequently SV calling as well, if you have any recommendations there - big goal of what I'm working on right now is looking at the benefits of T2T and long-read calling on solving "stubborn" genetic disorders, so I'm tinkering with whatever approaches I can get my hands on.

ADD REPLY
0
Entering edit mode

GraphMapper for mapping seems good for mapping to graphs as I mentioned. For SV calling there's odgi, see odgi pav and their nice tutorial for finding PresAbs Variations (but bed/tsv, not VCF generated).

But also you have to notice the distinction between (novel) SV calling and (known) SV genotyping (which was not clear to me for some time).

It seems you can use short reads mapped to a pangenome to genotype (not call new) known SVs. Tools: vg call (maybe), graphtyper (maybe), paragraph (also maybe). But this won't work at present for long reads mapped to a pangenome to my knowledge (lack of callers?). Or maybe just with odgi.

There's still a massive need for pangenome capable mappers and variation callers.

ADD REPLY
0
Entering edit mode

Thank you so much! I do recall seeing odgi come up, but I never looked into it - seems like it might be more in line with what I need.

And yeah, I'm definitely looking for calling over genotyping. The big issue with the disorders my lab studies is, we haven't been able to find anything causative so far. We're looking for novel variants that could potentially be explanatory, and since short reads and hg38 have failed so far on that front, I figured that looking at a more complete reference and longer reads might help (though short reads seem to perform pretty well on T2T-CHM13 as far as I've seen despite the low-complexity regions being present). And with the rise of graph genomics and pangenome references, it made sense to tinker with them as well - even if they're still in their infancy.

ADD REPLY
0
Entering edit mode

I added these software to a new repo here : https://github.com/colindaven/awesome-pangenomes/blob/main/README.md

Yes, sounds very challenging. If you have sufficient long read datasets which can be assembled from your patients, then the pangenome approach using PGGB and later odgi might be a good way to go. PGGB will give you SNPs out of the box, and while mapping to the resulting pangenome is challenging (maybe only practical with long reads), you can use odgi pav to find candidate SVs. Then potentially verify these on the linear long read contig assemblies? Tough!

ADD REPLY
0
Entering edit mode

Update: at the very least, chr1 is still failing with 256 GB RAM - this is just excessive. Thought it might be vg using RAM over storage because it can't find enough swap space to work with, but there's no option as far as I can tell to assign vg augment or vg map a temp directory, unlike previous steps in the chain.

ADD REPLY
2
Entering edit mode
13 months ago
glenn.hickey ▴ 520

vg augment was developed for short reads and never, to my knowledge, really tested on long reads. It seems like it is creating a graph that is too complex for downstream tools.

vg map's long read support isn't great either, and I would echo all the recommendations of using GraphAligner instead. (at least until vg giraffe gets a long read mapping mode).

So if you want to call novel variants using vg and long reads, your only choice now is to map them with GraphAligner, project them back to your reference (vg surject) and then use a linear-reference-based variant caller on the result.

ADD COMMENT
1
Entering edit mode

Thanks for the suggestion, I'm trying that out right now - though vg surject is complaining about dealing with GraphAligner's larger alignments, as well as not having "reference-sense paths" (which I assume has to do with how the graph was built, but I don't know what I should be doing differently with it).

warning:[vg::get_sequence_dictionary] No reference-sense paths available in the graph; falling back to generic paths.
warning[vg::Surjector]: Refusing to perform very large alignment against 104048 bp strand split subgraph for read ac826a06-ff7c-4246-880a-9005ebf15f6c; suppressing further warnings.

The output BAM I got was significantly smaller than the corresponding BAM from mapping with minimap2 (12 GB versus 45 GB), so I'm worried that a lot of things are getting pruned out. Any recommendations on what I can do to combat this?

ADD REPLY
0
Entering edit mode

Yeah, I'm looking through settings and documentation and I'm still not sure what I need to adjust here to avoid this throwing out 3/4 of my data - any assistance would be much appreciated.

ADD REPLY

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6