I'm trying (and failing) to multi-thread HaplotypeCaller in GATK 4. I read in a few places online that multi-threading in GATK 4 has been made more tricky, maybe even unfeasible, but all the places where I read that seem to be more than 1 yr old. Is there a new solution to that problem?
PS: I've read in a few places about Spark, but I still don't have no idea what it is or how to use it.
Here's what I have at this point:
java -Xmx16g -XX:ParallelGCThreads=1 -jar gatk-package-184.108.40.206-local.jar HaplotypeCaller -R myfasta.fasta -I mybam.bam -O mygvcf.g.vcf --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1 --num_cpu_threads_per_data_thread 2 A USER ERROR has occurred: num_cpu_threads_per_data_thread is not a recognized option
I note that the official documentation for the spark implementation of HaplotypeCaller still says the following:
Does anybody happen to know if the BROAD folks are simply being extra cautious, or should this warning be taken at face value?
Honestly, I don't know if they are just overly cautious or if there are still problems with HaplotypeCallerSpark.
raf.marcondes , in view of the above warning, you should stick to a "poors man" parallelism using
keep in mind that you can ask Broad personnel yourself on their forum: https://gatkforums.broadinstitute.org/gatk/categories/ask-the-team
Thank you SO MUCH for your helpful answer! I'm unsure what you mean by "running GATK4 using the bundled script" though. Do I need to re-install GATK in a different way to do that?
I don't think you need to reinstall. This is the contents of my GATK folder:
The first entry, named simply
gatk, is a python wrapper script that should be used, instead of the jar file: