Question

Scatter Gather principle by chromosome on Gatk

0

Entering edit mode

17 months ago

Sushant • 0

Hi all,

On a quest to optimize gatk pipeline, I met scatter gather principle, so I did following,

 pids=
 for chr in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13  chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
  do
    gatk --java-options "-Xmx5g" HaplotypeCaller -R $hg38 -I dedup.bqsr.bam --dbsnp assembly38.dbsnp138.vcf.gz -O $chr.vcf --native-pair-hmm-threads 10 -L $chr &
    pids+=" $!"
  done;

Similar logic for BaseRecalibrator and ApplyBQSR Merging is like,

gatk --java-options "-Xmx16g" MergeVcfs \
  I=$out1/variantCaller/HaplotypeCaller/chr1.vcf \
  ..
  I=$out1/variantCaller/HaplotypeCaller/chr22.vcf \
  I=$out1/variantCaller/HaplotypeCaller/chrX.vcf \
  I=$out1/variantCaller/HaplotypeCaller/chrY.vcf \
  O=$out1/variantCaller/HaplotypeCaller/output_variants_16-chunk_merged.vcf.gz

However, the output after merging them is not same as when running as entire genome together.

Also for ApplyBQSR, the summation of bam file size of individual chromosome is not equal to bam file size when passed without breaking.

Questions-

Is this approach legitimate? if yes, what can I do to match the output.
If it wont match ever, how can I make bioinformatics team to understand it is correct approach.
If the above approach for scatter gather is wrong, what can I do next?

haplotypecaller gatk bqsr • 442 views

ADD COMMENT • link updated 17 months ago by GenoMax 141k • written 17 months ago by Sushant • 0