Question

HaplotypeCallerSpark How to use

0

Entering edit mode

4.7 years ago

nanoide ▴ 120

Hi all. So I wanted to try the HaplotypeCaller Spark implementation in GATK4. I'm aware it's beta and not totally recomended yet, but we want to try it.

So I wanted to ask about the usage, I looked for documentation but I'm not clear on some errors I'm getting. I also wanted to ask about the --strict option. It's supposed to give similar results to the non Spark haplotypecaller but with worse speeds. Does anyone know if even then, the running is faster?

So with haplotypecallersparkI'm using java 1.8 and the line:

gatk HaplotypeCallerSpark --java-options "-Xmx4g" -R ref.fa -I XX.bam -O XX.vcf.gz -ERC GVCF --native-pair-hmm-threads 8 --spark-master local[8] --conf 'spark.executor.cores=8'

First of all, in Stage 1 I'm getting too many threads. These are relevant lines in the log I think:

INFO DAGScheduler: Submitting 60 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[22] at mapToPair at SparkSharder.java:247) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
INFO TaskSchedulerImpl: Adding task set 1.0 with 60 tasks
INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 9311 bytes)

Then it starts to open threads up to 60 with lines like:

INFO Executor: Running task 0.0 in stage 1.0 (TID 1)

And some of them fail with:

java.lang.ArrayIndexOutOfBoundsException: -1
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
ERROR Executor: Exception in task 35.0 in stage 1.0 (TID 36)

So then Stage 1 is cancelled. I think this may be related with the use of so many threads, because I'm in a queing system. Does anyone knows how to control this?

Thanks

snp GATK • 2.2k views

ADD COMMENT • link updated 4.6 years ago by lakhujanivijay 5.8k • written 4.7 years ago by nanoide ▴ 120

score 1 · Answer 1 · 2019-09-27

HaplotypeCallerSpark is in beta mode and behaves unexpectedly. My experience with it is that

it does not matter how big or small the dataset is, it may fail sometimes and sometimes it may work!
changing the value for --native-pair-hmm-threads parameter does not help. I tried it with 10 (optimum) , 4(default) and 54(hyperthreading)
- sometimes it fails with 54 and works with 10
- sometimes it fails with both 54 and 10 and works with default 4
- sometimes it fails with default 4

I am also looking for a resolution