Question: Pararellization in GATK 4
2
gravatar for raf.marcondes
4 weeks ago by
raf.marcondes20 wrote:

Hi all,

I'm trying (and failing) to multi-thread HaplotypeCaller in GATK 4. I read in a few places online that multi-threading in GATK 4 has been made more tricky, maybe even unfeasible, but all the places where I read that seem to be more than 1 yr old. Is there a new solution to that problem?

PS: I've read in a few places about Spark, but I still don't have no idea what it is or how to use it.

Here's what I have at this point:

   java -Xmx16g -XX:ParallelGCThreads=1 -jar gatk-package-4.1.3.0-local.jar HaplotypeCaller -R myfasta.fasta -I mybam.bam -O mygvcf.g.vcf --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1 --num_cpu_threads_per_data_thread 2

A USER ERROR has occurred: num_cpu_threads_per_data_thread is not a recognized option
ADD COMMENTlink modified 28 days ago • written 4 weeks ago by raf.marcondes20
3
gravatar for h.mon
4 weeks ago by
h.mon28k
Brazil
h.mon28k wrote:

For start, you should not be using java -jar gatk-package-4.1.3.0-local.jar with GATK4, the recommended and supported method of running GATK4 is using the bundled script:

gatk --java-options "-Xmx16g -XX:ParallelGCThreads=1" [...]

In GATK4, multithreading is implemented using Spark, see Document how multi-threading support works in GATK4. As you noted, documentation is scattered and scarce - e.g. (How to) Run Spark-enabled GATK tools on a local multi-core machine.

Based on this Spark GATK4 page, you can try:

 gatk --java-options "-Xmx16g -XX:ParallelGCThreads=1" --spark-master local[2] \
    HaplotypeCaller -R myfasta.fasta -I mybam.bam -O mygvcf.g.vcf \
    --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1

edit: another common method for parallelizing HaplotypeCaller is using the -L option to restrict calling to one chromosome, and process several chromosomes simultaneously - see Intervals and interval lists.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by h.mon28k
1

I note that the official documentation for the spark implementation of HaplotypeCaller still says the following:

This tool DOES NOT match the output of HaplotypeCaller. * * It is still under development and should not be used for production work. * * For evaluation only. * * Use the non-spark HaplotypeCaller if you care about the results.

Does anybody happen to know if the BROAD folks are simply being extra cautious, or should this warning be taken at face value?

ADD REPLYlink written 4 weeks ago by Dave Carlson280

Honestly, I don't know if they are just overly cautious or if there are still problems with HaplotypeCallerSpark.

raf.marcondes , in view of the above warning, you should stick to a "poors man" parallelism using -L.

ADD REPLYlink written 4 weeks ago by h.mon28k

keep in mind that you can ask Broad personnel yourself on their forum: https://gatkforums.broadinstitute.org/gatk/categories/ask-the-team

ADD REPLYlink written 4 weeks ago by steve2.4k

Thank you SO MUCH for your helpful answer! I'm unsure what you mean by "running GATK4 using the bundled script" though. Do I need to re-install GATK in a different way to do that?

ADD REPLYlink written 4 weeks ago by raf.marcondes20
1

I don't think you need to reinstall. This is the contents of my GATK folder:

ls -lh ~/bin/GATK-4.1.4.0/
total 407M
-rwxr-xr-x 1 hmon hmon  20K Oct  8 15:34 gatk
-rw-r--r-- 1 hmon hmon 851K Oct  8 15:34 gatk-completion.sh
-rw-r--r-- 1 hmon hmon  964 Oct  8 15:34 gatkcondaenv.yml
-rw-r--r-- 1 hmon hmon 3.6K Oct  8 15:34 GATKConfig.EXAMPLE.properties
drwxr-xr-x 2 hmon hmon  68K Oct  8 15:34 gatkdoc
-rw-r--r-- 1 hmon hmon 271M Oct  8 15:34 gatk-package-4.1.4.0-local.jar
-rw-r--r-- 1 hmon hmon 135M Oct  8 15:34 gatk-package-4.1.4.0-spark.jar
-rw-r--r-- 1 hmon hmon 113K Oct  8 15:34 gatkPythonPackageArchive.zip
-rw-r--r-- 1 hmon hmon  38K Oct  8 15:34 README.md
drwxr-xr-x 5 hmon hmon 4.0K Oct  8 15:34 scripts
  

The first entry, named simply gatk, is a python wrapper script that should be used, instead of the jar file:

head -n 17 ~/bin/GATK-4.1.4.0/gatk
#!/usr/bin/env python
#
# Launcher script for GATK tools. Delegates to java -jar, spark-submit, or gcloud as appropriate,
# and sets many important Spark and htsjdk properties before launch.
#
# If running a non-Spark tool, or a Spark tool in local mode, will search for GATK executables
# as follows:
#     -If the GATK_LOCAL_JAR environment variable is set, uses that jar
#     -Otherwise if the GATK_RUN_SCRIPT created by "gradle installDist" exists, uses that
#     -Otherwise uses the newest local jar in the same directory as the script or the BIN_PATH
#      (in that order of precedence)
#
# If running a Spark tool, searches for GATK executables as follows:
#     -If the GATK_SPARK_JAR environment variable is set, uses that jar
#     -Otherwise uses the newest Spark jar in the same directory as the script or the BIN_PATH
#      (in that order of precedence)
#
  
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by h.mon28k
0
gravatar for raf.marcondes
28 days ago by
raf.marcondes20 wrote:

Just to follow up, I figured this out. Here's how to make HaplotypeCallerSpark work, using 2 cores:

gatk --java-options  "-Xmx16g -XX:ParallelGCThreads=1" HaplotypeCallerSpark --spark-master local[2] -R myfasta.fasta -I mybam.bam -O mygvcf.g.vcf --emit-ref-confidence GVCF --min-dangling-branch-length 1 --min-pruning 1
ADD COMMENTlink written 28 days ago by raf.marcondes20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 969 users visited in the last hour