Question: Pararellization in GATK 4
3
gravatar for raf.marcondes
10 months ago by
raf.marcondes30 wrote:

Hi all,

I'm trying (and failing) to multi-thread HaplotypeCaller in GATK 4. I read in a few places online that multi-threading in GATK 4 has been made more tricky, maybe even unfeasible, but all the places where I read that seem to be more than 1 yr old. Is there a new solution to that problem?

PS: I've read in a few places about Spark, but I still don't have no idea what it is or how to use it.

Here's what I have at this point:

   java -Xmx16g -XX:ParallelGCThreads=1 -jar gatk-package-4.1.3.0-local.jar HaplotypeCaller -R myfasta.fasta -I mybam.bam -O mygvcf.g.vcf --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1 --num_cpu_threads_per_data_thread 2

A USER ERROR has occurred: num_cpu_threads_per_data_thread is not a recognized option
ADD COMMENTlink modified 9 months ago • written 10 months ago by raf.marcondes30
3
gravatar for h.mon
10 months ago by
h.mon30k
Brazil
h.mon30k wrote:

For start, you should not be using java -jar gatk-package-4.1.3.0-local.jar with GATK4, the recommended and supported method of running GATK4 is using the bundled script:

gatk --java-options "-Xmx16g -XX:ParallelGCThreads=1" [...]

In GATK4, multithreading is implemented using Spark, see Document how multi-threading support works in GATK4. As you noted, documentation is scattered and scarce - e.g. (How to) Run Spark-enabled GATK tools on a local multi-core machine.

Based on this Spark GATK4 page, you can try:

 gatk --java-options "-Xmx16g -XX:ParallelGCThreads=1" --spark-master local[2] \
    HaplotypeCaller -R myfasta.fasta -I mybam.bam -O mygvcf.g.vcf \
    --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1

edit: another common method for parallelizing HaplotypeCaller is using the -L option to restrict calling to one chromosome, and process several chromosomes simultaneously - see Intervals and interval lists.

ADD COMMENTlink modified 10 months ago • written 10 months ago by h.mon30k
1

I note that the official documentation for the spark implementation of HaplotypeCaller still says the following:

This tool DOES NOT match the output of HaplotypeCaller. * * It is still under development and should not be used for production work. * * For evaluation only. * * Use the non-spark HaplotypeCaller if you care about the results.

Does anybody happen to know if the BROAD folks are simply being extra cautious, or should this warning be taken at face value?

ADD REPLYlink written 10 months ago by Dave Carlson320

Honestly, I don't know if they are just overly cautious or if there are still problems with HaplotypeCallerSpark.

raf.marcondes , in view of the above warning, you should stick to a "poors man" parallelism using -L.

ADD REPLYlink written 10 months ago by h.mon30k

keep in mind that you can ask Broad personnel yourself on their forum: https://gatkforums.broadinstitute.org/gatk/categories/ask-the-team

ADD REPLYlink written 10 months ago by steve2.6k

Thank you SO MUCH for your helpful answer! I'm unsure what you mean by "running GATK4 using the bundled script" though. Do I need to re-install GATK in a different way to do that?

ADD REPLYlink written 10 months ago by raf.marcondes30
1

I don't think you need to reinstall. This is the contents of my GATK folder:

ls -lh ~/bin/GATK-4.1.4.0/
total 407M
-rwxr-xr-x 1 hmon hmon  20K Oct  8 15:34 gatk
-rw-r--r-- 1 hmon hmon 851K Oct  8 15:34 gatk-completion.sh
-rw-r--r-- 1 hmon hmon  964 Oct  8 15:34 gatkcondaenv.yml
-rw-r--r-- 1 hmon hmon 3.6K Oct  8 15:34 GATKConfig.EXAMPLE.properties
drwxr-xr-x 2 hmon hmon  68K Oct  8 15:34 gatkdoc
-rw-r--r-- 1 hmon hmon 271M Oct  8 15:34 gatk-package-4.1.4.0-local.jar
-rw-r--r-- 1 hmon hmon 135M Oct  8 15:34 gatk-package-4.1.4.0-spark.jar
-rw-r--r-- 1 hmon hmon 113K Oct  8 15:34 gatkPythonPackageArchive.zip
-rw-r--r-- 1 hmon hmon  38K Oct  8 15:34 README.md
drwxr-xr-x 5 hmon hmon 4.0K Oct  8 15:34 scripts
  

The first entry, named simply gatk, is a python wrapper script that should be used, instead of the jar file:

head -n 17 ~/bin/GATK-4.1.4.0/gatk
#!/usr/bin/env python
#
# Launcher script for GATK tools. Delegates to java -jar, spark-submit, or gcloud as appropriate,
# and sets many important Spark and htsjdk properties before launch.
#
# If running a non-Spark tool, or a Spark tool in local mode, will search for GATK executables
# as follows:
#     -If the GATK_LOCAL_JAR environment variable is set, uses that jar
#     -Otherwise if the GATK_RUN_SCRIPT created by "gradle installDist" exists, uses that
#     -Otherwise uses the newest local jar in the same directory as the script or the BIN_PATH
#      (in that order of precedence)
#
# If running a Spark tool, searches for GATK executables as follows:
#     -If the GATK_SPARK_JAR environment variable is set, uses that jar
#     -Otherwise uses the newest Spark jar in the same directory as the script or the BIN_PATH
#      (in that order of precedence)
#
  
ADD REPLYlink modified 10 months ago • written 10 months ago by h.mon30k
0
gravatar for raf.marcondes
9 months ago by
raf.marcondes30 wrote:

Just to follow up, I figured this out. Here's how to make HaplotypeCallerSpark work, using 2 cores:

gatk --java-options  "-Xmx16g -XX:ParallelGCThreads=1" HaplotypeCallerSpark --spark-master local[2] -R myfasta.fasta -I mybam.bam -O mygvcf.g.vcf --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1
ADD COMMENTlink written 9 months ago by raf.marcondes30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1601 users visited in the last hour