I would like to how to use Spark within GATK for multi-threading analysis. Unfortunately, the Broad Institute website for its cluster-Spark tutorial documentation is still in progress. I am using HaplotypeCaller which has been working fine but now I have some pooled seq samples and they take much longer so would like to spread the workload. This is an example of my usage:
gatk HaplotypeCaller -I my_pooled_sample.bam -I another_pooled_sample.bam -L a_chromosome -R ref_genome.fna -O my_out_file.g.vcf -ploidy 10 -- --spark-master local
I used the above sparks command from this example. But it didn't work. I checked the help info and got this:
> gatk forwards commands to GATK and adds some sugar for submitting spark jobs > --spark-runner <target> controls how spark tools are run > valid targets are: > LOCAL: run using the in-memory spark runner > SPARK: run using spark-submit on an existing cluster > --spark-master must be specified > --spark-submit-command may be specified to control the Spark submit command > arguments to spark-submit may optionally be specified after -- > GCS: run using Google cloud dataproc > commands after the -- will be passed to dataproc > --cluster <your-cluster> must be specified after the -- > spark properties and some common spark-submit parameters will be translated > to dataproc equivalents
I then tried using:
Which also didn't work. I would appreciate some guidance. Many thanks.