I've been browsing forums and trying to figure out how to run haplotypecaller from GATK more efficiently. I want to minimize the size of each job I'm submitting because I've had issues with jobs timing out on my cluster (using SLURM) and also taking forever to run when they do. I had been running:
"$gatk" HaplotypeCaller -R "$reference" -I "$bam" -O "$out"."$gvcf".g.vcf -ERC GVCF
with the appropriate variables filled in. I want to break up the bamfiles using -L option and split into manageable chunks based on scaffolds. The header of a bamfile looks like this:
[mgdesaix@xxxx]$ samtools view Plate1.18N00490_RG.bam -H
@HD VN:1.6 SO:coordinate
@SQ SN:scaffold1|size5275185 LN:5275358
@SQ SN:scaffold2|size3399639 LN:3399639
@SQ SN:scaffold3|size3342599 LN:3342599
@SQ SN:scaffold4|size3742848 LN:3742848
etc., ending with
@SQ SN:scaffold45765|size8352 LN:8352
@SQ SN:scaffold47060|size9013 LN:9013
@SQ SN:scaffold47992|size8514 LN:8514
then ID, LB, etc.
To use -L, would I specify
"$gatk" HaplotypeCaller -R "$reference" -I "$bam" -O "$out"."$gvcf".g.vcf -ERC GVCF -L scaffold1:10000 to only produce a gvcf file for the first 10,000 scaffolds? And if so and then I produce multiple gvcf files for an individual broken up by the scaffolds, does GATK seamlessly merge the gvcf files by ID once I combine all of them or is there an extra step I need to do to make sure individuals are recombined (individuals in this case are specified by unique ID)?
If anyone has input on this or where to find an example that would be great! So far I hadn't been able to figure it out based on my interpretation of the GATK documentation and browsing elsewhere.