Question

How much RAM / how many CPUs should I allocate for Mutect2?

0

Entering edit mode

3.9 years ago

jrleary ▴ 210

I'm running Mutect2 on some WES data. The .bam file is 4.7G, and I'm comparing it against the hg38 reference genome. I allocated 8 CPUs and 90G of memory using slurm, but progress has been very slow. If I wanted the job to complete for single sample within ~24 hours, what sort of CPU and memory allocation should I be using?

WES • 3.1k views

ADD COMMENT • link updated 3.9 years ago by steve ★ 3.5k • written 3.9 years ago by jrleary ▴ 210

2

Entering edit mode

If there are parts of this pipeline that are single threaded there is not much you can do to speed things up.

ADD REPLY • link 3.9 years ago by GenoMax 141k

0

Entering edit mode

According to this post on the GATK forums, Mutect2 does not support multithreading. With 100G of RAM, it took 4.35 hours to process the first chromosome. This is using gatk v4.1.2, which supposedly has "significant speed improvements."

ADD REPLY • link 3.9 years ago by jrleary ▴ 210

score 0 · Answer 1 · 2020-05-30

MuTect2 has historically not run multi-threaded, in fact you are discouraged from enabling multithread options with it (some GATK engine multithread options were still accessible but did not work). Yes, it is excruciatingly slow to run. For memory, I only ever used 12GB for target exome sequencing samples, so you might start around there and increase as needed. Since its running single threaded, you should not need more than 1 CPU allocated.

The best way to speed up MuTect2 is to instead run multiple instances of it at once with the --intervals option to supply a .bed file of genomic regions for it to analyze. In this way, you can break up a target list of ~10,000 regions into 100 region chunks to be run in parallel (example here, script here). This will give you a massive speed increase, but you will likely want to use some kind of pipeline orchestration framework to manage this since it will result in a huge number of cluster jobs, and then a huge number of resulting .vcf files that need to be processed and merged afterwards. That is the technique that I used in my workflow here; https://github.com/NYU-Molecular-Pathology/NGS580-nf/blob/3ba2f970c3fbee56080ba60727f7bf43cb1be3b2/main.nf#L2301-L2359