Question: HaplotypeCallerSpark How to use
gravatar for nanoide
11 months ago by
nanoide50 wrote:

Hi all. So I wanted to try the HaplotypeCaller Spark implementation in GATK4. I'm aware it's beta and not totally recomended yet, but we want to try it.

So I wanted to ask about the usage, I looked for documentation but I'm not clear on some errors I'm getting. I also wanted to ask about the --strict option. It's supposed to give similar results to the non Spark haplotypecaller but with worse speeds. Does anyone know if even then, the running is faster?

So with haplotypecallersparkI'm using java 1.8 and the line:

gatk HaplotypeCallerSpark --java-options "-Xmx4g" -R ref.fa -I XX.bam -O XX.vcf.gz -ERC GVCF --native-pair-hmm-threads 8 --spark-master local[8] --conf 'spark.executor.cores=8'

First of all, in Stage 1 I'm getting too many threads. These are relevant lines in the log I think:

INFO DAGScheduler: Submitting 60 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[22] at mapToPair at (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
INFO TaskSchedulerImpl: Adding task set 1.0 with 60 tasks
INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 9311 bytes)

Then it starts to open threads up to 60 with lines like:

INFO Executor: Running task 0.0 in stage 1.0 (TID 1)

And some of them fail with:

java.lang.ArrayIndexOutOfBoundsException: -1
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.util.concurrent.ThreadPoolExecutor$
ERROR Executor: Exception in task 35.0 in stage 1.0 (TID 36)

So then Stage 1 is cancelled. I think this may be related with the use of so many threads, because I'm in a queing system. Does anyone knows how to control this?


snp gatk • 611 views
ADD COMMENTlink modified 10 months ago by lakhujanivijay5.1k • written 11 months ago by nanoide50
gravatar for lakhujanivijay
10 months ago by
lakhujanivijay5.1k wrote:

HaplotypeCallerSpark is in beta mode and behaves unexpectedly. My experience with it is that

  • it does not matter how big or small the dataset is, it may fail sometimes and sometimes it may work!
  • changing the value for --native-pair-hmm-threads parameter does not help. I tried it with 10 (optimum) , 4(default) and 54(hyperthreading)
    • sometimes it fails with 54 and works with 10
    • sometimes it fails with both 54 and 10 and works with default 4
    • sometimes it fails with default 4

I am also looking for a resolution

ADD COMMENTlink written 10 months ago by lakhujanivijay5.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 699 users visited in the last hour