Entering edit mode
8.1 years ago
Bogdan
★
1.4k
Dear all,
we have just set up a cluster with 4 nodes (128GB and 32 CPUs per node). Please could you let me know what would be the optimal configuration (RAM/CPU) in order to run GATK/Mutect on a node or any other variant calling software (such as Strelka, Varscan, SomaticSnipper) ? many thanks,
-- bogdan
I can't give you any hard numbers, but I think you will be memory-bound before you are cpu-bound. I have 64Gb of RAM and I could run 4 instances of GATK before I maxed out (different components of the pipeline use more/less memory, but it seemed 4 was a good number for essentially every step). However that is with the default configuration running every step in parallel naively. There is plenty of space for optimization by changing parameters of GATK/Picard and using more sophisticated pipelining such that high-memory jobs are run with low-memory jobs.
Also, do not neglect the amount of disk-space you'll need! Other users of Biostars have commented that GATK uses a ton of space in temp files, but this can be overcome by diverging from the best practices and piping things together better. Furthermore, recent versions of the HaplotypeCaller include some element of BQSR and IndelRealignment built-in, so those separate steps can maybe be skipped without much of a difference to the final SNP calling. I haven't done either of those things, but it suggests that you will probably start with a pipeline that runs 8-9 jobs in parallel, and you will be able to tune things to bring that number up and maximize your resources as you learn more about your data and how these tools work.
thanks a lot John fro sharing your experience with GATK !