On a quest to optimize gatk pipeline, I met scatter gather principle, so I did following,
pids= for chr in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY do gatk --java-options "-Xmx5g" HaplotypeCaller -R $hg38 -I dedup.bqsr.bam --dbsnp assembly38.dbsnp138.vcf.gz -O $chr.vcf --native-pair-hmm-threads 10 -L $chr & pids+=" $!" done;
Similar logic for BaseRecalibrator and ApplyBQSR Merging is like,
gatk --java-options "-Xmx16g" MergeVcfs \ I=$out1/variantCaller/HaplotypeCaller/chr1.vcf \ .. I=$out1/variantCaller/HaplotypeCaller/chr22.vcf \ I=$out1/variantCaller/HaplotypeCaller/chrX.vcf \ I=$out1/variantCaller/HaplotypeCaller/chrY.vcf \ O=$out1/variantCaller/HaplotypeCaller/output_variants_16-chunk_merged.vcf.gz
However, the output after merging them is not same as when running as entire genome together.
Also for ApplyBQSR, the summation of bam file size of individual chromosome is not equal to bam file size when passed without breaking.
- Is this approach legitimate? if yes, what can I do to match the output.
- If it wont match ever, how can I make bioinformatics team to understand it is correct approach.
- If the above approach for scatter gather is wrong, what can I do next?