Question

Java -xmx option does not limit memory usage in the ImputePipelinePlugin?

0

Entering edit mode

18 months ago

twrl8 • 0

Hello!

I am currently trying to run the pathfinding step in the PHG pipeline (v1.2) by using the -ImputePipelinePlugin -imputeTarget pathToVCF options, but might be running into memory issues.

I used the command in this way:

singularity exec -B /netscratch:/netscratch phg_1_2.simg /tassel-5-standalone/run_pipeline.pl -Xmx150G -debug -configParameters /PHG/pathfinding_config.txt -ImputePipelinePlugin -imputeTarget pathToVCF -endPlugin

I might missunderstand this, but should the -Xmx option not limit the amount of memory the job can be using? After about a day of running my job now seems to use 320GB of memory and I am worried this might increase even more and potentially reach the memory limit of the machine I'm using or cause other peoples jobs running on it to crash.

Is there a way to estimate roughly how much memory this job will use in the end?
E.g. by looking at the size of the pangenome fasta (78GB). I am currently running imputation for one sample (paired end reads, 2 gzipped fastq files à 30GB, ~430million reads). In the config file I chose numThreads=70 (but did not include any xmx parameter there).

Does someone have prior experience with this?
Many thanks in advance!!

PHG phg • 1.4k views

ADD COMMENT • link 18 months ago by twrl8 • 0

0

Entering edit mode

Based on the program name run_pipeline.pl this appears to be a perl script. Option you are referring to is for Java. Is that perl script calling some java code? Otherwise including that option does nothing for the perl code unless it is required for singularity (not a user myself).

ADD REPLY • link 18 months ago by GenoMax 146k

0

Entering edit mode

Does the amount of memory keep increasing, or does it start high and remain stable? Do you have a log file you can post? That may give us information on what is allocated by/for Singularity vs what is allocated for the PHG java code.

ADD REPLY • link 18 months ago by lcj34 ▴ 420

0

Entering edit mode

When it started it relatively quickly went up to 200GB, then steadily increased to now 327GB. So yes, it still seems to be increasing.

Do you mean the console output? It has already started minimap2, so that might be what is using so much memory?

(Apologies, the output is very long, but it just continues increasing the number of Processed alignments. Up to 1946000000 so far.)

ADD REPLY • link 18 months ago by twrl8 • 0

0

Entering edit mode

Can you trim some of the log output on GitHub gist? We get the idea of what is happening.

minimap is using -t 126 (126 threads) so that may also have something to do with memory usage.

ADD REPLY • link 18 months ago by GenoMax 146k

0

Entering edit mode

Done. Many apologies!
And thank you for checking.

Yes, there seems to be something up with that

ADD REPLY • link 18 months ago by twrl8 • 0

score 0 · Answer 1 · 2023-03-22

0

Entering edit mode

18 months ago

zrm22 ▴ 40

The -Xmx flag will limit the JVM heap space for the java process called within run_pipeline.pl. The issue here is the the ImputePipelinePlugin needs to execute minimap2 which is executed on a different system process than what the JVM is running on. To my understanding minimap2 will use all available RAM if it needs to.

So I think you have a few options

Lower number of threads - should make minimap2 use less RAM, but you will still need to load in the index file which gets fairly large
Limit the memory allocated to the singularity container -Singularity Documentation . This should force anything run within the container to be limited to your request. However once it hits that cap, it will likely stop.

Just a note, we are currently investigating and implementing a new version of the Fastq -> ReadMapping file step which is likely what you are running into here. This new version completely bypasses minimap2 and uses Kmers to figure out the read mappings. Initial testing is very promising as the RAM usage is far lower(10-20GB) and the speed is very good(1-2 minutes for a 2-3x WGS paired end fastq pair) and the results are close enough to what minimap2 provides that the Path finding is nearly identical. Hopefully this will be included in the next version of the PHG.

ADD COMMENT • link 18 months ago by zrm22 ▴ 40

0

Entering edit mode

Ahh thank you!

I thought the xmx parameter would be passed to the downstream commands called by the plugin. Then I think I definitely need to limit Singularity, since I can't take up all the memory on this machiene.

With the thread number, as GenoMax pointed out minimap uses 126 threads. This is 2 less than my machine has, so that would fit well with the Documentation saying:

The number of threads that will be used to impute individual paths is numThreads - 2 because 2 threads are reserved for other operations.

However in the config file I set it to numThreads=70, so could there be something going wrong or something I didn't set that prevents this parameter to be passed to minimap?

That update does sound very enticing! Since I need to do this for a lot more samples and the previous test I started ran for over a week before crashing due to memory (since someone else was also using it heavily), so increasing the speed while reducing memory requirements sounds fantastic.
I apologise, this is probably unfair to ask since these things simply take their time, but is there a rough idea when that next update would be published?

ADD REPLY • link 18 months ago by twrl8 • 0

0

Entering edit mode

It looks like the ImputePipelinePlugin for the minimap2 run does not use the numThreads Option but rather does the numThreadsOnMachine - 2 as mentioned in the documentation. My intuition is that you can likely lower the number of cpus that singularity has access to and that might do what you need it to do. I will add a ticket for us to add the parameter for the minimap2 runs. It would definitely be nice to allow the user to change this easily.

For the timeline for the next update, my goal is to get this module released in the next coming months. If the algorithm has been fully tested and works, we should have it out by end of summer at the latest. I think we can likely have it ready for other people to test in April, but we may have more pressing things come up which would delay.

ADD REPLY • link 18 months ago by zrm22 ▴ 40

0

Entering edit mode

Thank you very much! I will try using the singularity options.

Thank you for that aswell. I will keep an eye on the docker hub for newer versions, since this could really help me. I think docker hub shows the code added, though is there anywhere to see which functionalities it brings?

ADD REPLY • link 18 months ago by twrl8 • 0