BaseRecalibrator takes forever to run. Any suggestions?
0
0
Entering edit mode
15 months ago
khorms ▴ 210

Hello, I am trying to run BaseRecalibrator tool from GATK package and it takes forever (more than 4 days per one bam file). The command I'm using is:

gatk BaseRecalibrator -I NG-01_1_S1_dedup_bwa.bam -R /rumi/shams/genomes/hg38/hg38.fa --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.dbsnp138.vcf -O NG-01_1_S1_dedup_bwa_BSQR.table


(I run it through Conda installation of GATK (link), which shouldn't matter)
I've googled a lot about it; it looks like there were a lot of discussions on this subject on GATK forums but for some reason the GATK forum webpages are not available anymore.
As far as I know BaseRecalibrator is not parallelizable unless I run it with Spark. However, the Spark version of the program (BaseRecalibratorSpark) is in beta version so I am cautious about using it.
The bam files I run it on are rather large (~40G each); I run 10 commands in parallel on a server with 88 cores and 400G RAM; the processes have been running for 4 days each and they are still not done. However, it looks like generally BaseRecalibrator can run in ~5 hours per exome (for example, @Nicolas Rosewick's comments in this post)
Any recommendations on how can I speed it up?

gatk whole exome • 595 views
0
Entering edit mode

Is it still running? Use top to check. Maybe it got killed and the node is running in idle mode without actually terminating the submitted job. Maybe there is a massive I/O bottlebeck?

0
Entering edit mode

yes, they are certainly running, I am checking every now and then haha
How would you detect an I/O bottleneck? I think this could be the case potentially