Hello, I am trying to run BaseRecalibrator tool from GATK package and it takes forever (more than 4 days per one bam file). The command I'm using is:
gatk BaseRecalibrator -I NG-01_1_S1_dedup_bwa.bam -R /rumi/shams/genomes/hg38/hg38.fa --known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.dbsnp138.vcf -O NG-01_1_S1_dedup_bwa_BSQR.table
(I run it through Conda installation of GATK (link), which shouldn't matter)
I've googled a lot about it; it looks like there were a lot of discussions on this subject on GATK forums but for some reason the GATK forum webpages are not available anymore.
As far as I know BaseRecalibrator is not parallelizable unless I run it with Spark. However, the Spark version of the program (BaseRecalibratorSpark) is in beta version so I am cautious about using it.
The bam files I run it on are rather large (~40G each); I run 10 commands in parallel on a server with 88 cores and 400G RAM; the processes have been running for 4 days each and they are still not done. However, it looks like generally BaseRecalibrator can run in ~5 hours per exome (for example, @Nicolas Rosewick's comments in this post)
Any recommendations on how can I speed it up?