Mutect2 on some WES data. The
.bam file is 4.7G, and I'm comparing it against the hg38 reference genome. I allocated 8 CPUs and 90G of memory using slurm, but progress has been very slow. If I wanted the job to complete for single sample within ~24 hours, what sort of CPU and memory allocation should I be using?
MuTect2 has historically not run multi-threaded, in fact you are discouraged from enabling multithread options with it (some GATK engine multithread options were still accessible but did not work). Yes, it is excruciatingly slow to run. For memory, I only ever used 12GB for target exome sequencing samples, so you might start around there and increase as needed. Since its running single threaded, you should not need more than 1 CPU allocated.
The best way to speed up MuTect2 is to instead run multiple instances of it at once with the
--intervals option to supply a .bed file of genomic regions for it to analyze. In this way, you can break up a target list of ~10,000 regions into 100 region chunks to be run in parallel (example here, script here). This will give you a massive speed increase, but you will likely want to use some kind of pipeline orchestration framework to manage this since it will result in a huge number of cluster jobs, and then a huge number of resulting .vcf files that need to be processed and merged afterwards. That is the technique that I used in my workflow here; https://github.com/NYU-Molecular-Pathology/NGS580-nf/blob/3ba2f970c3fbee56080ba60727f7bf43cb1be3b2/main.nf#L2301-L2359