Question: How much RAM / how many CPUs should I allocate for Mutect2?
0
gravatar for jrleary
5 months ago by
jrleary130
Lineberger Comprehensive Cancer Center
jrleary130 wrote:

I'm running Mutect2 on some WES data. The .bam file is 4.7G, and I'm comparing it against the hg38 reference genome. I allocated 8 CPUs and 90G of memory using slurm, but progress has been very slow. If I wanted the job to complete for single sample within ~24 hours, what sort of CPU and memory allocation should I be using?

wes • 317 views
ADD COMMENTlink modified 5 months ago by steve2.6k • written 5 months ago by jrleary130
2

If there are parts of this pipeline that are single threaded there is not much you can do to speed things up.

ADD REPLYlink written 5 months ago by genomax92k

According to this post on the GATK forums, Mutect2 does not support multithreading. With 100G of RAM, it took 4.35 hours to process the first chromosome. This is using gatk v4.1.2, which supposedly has "significant speed improvements."

ADD REPLYlink written 5 months ago by jrleary130
0
gravatar for steve
5 months ago by
steve2.6k
United States
steve2.6k wrote:

MuTect2 has historically not run multi-threaded, in fact you are discouraged from enabling multithread options with it (some GATK engine multithread options were still accessible but did not work). Yes, it is excruciatingly slow to run. For memory, I only ever used 12GB for target exome sequencing samples, so you might start around there and increase as needed. Since its running single threaded, you should not need more than 1 CPU allocated.

The best way to speed up MuTect2 is to instead run multiple instances of it at once with the --intervals option to supply a .bed file of genomic regions for it to analyze. In this way, you can break up a target list of ~10,000 regions into 100 region chunks to be run in parallel (example here, script here). This will give you a massive speed increase, but you will likely want to use some kind of pipeline orchestration framework to manage this since it will result in a huge number of cluster jobs, and then a huge number of resulting .vcf files that need to be processed and merged afterwards. That is the technique that I used in my workflow here; https://github.com/NYU-Molecular-Pathology/NGS580-nf/blob/3ba2f970c3fbee56080ba60727f7bf43cb1be3b2/main.nf#L2301-L2359

ADD COMMENTlink modified 5 months ago • written 5 months ago by steve2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1736 users visited in the last hour