Scatter Gather principle by chromosome on Gatk
0
0
Entering edit mode
17 months ago
Sushant • 0

Hi all,

On a quest to optimize gatk pipeline, I met scatter gather principle, so I did following,

 pids=
 for chr in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13  chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
  do
    gatk --java-options "-Xmx5g" HaplotypeCaller -R $hg38 -I dedup.bqsr.bam --dbsnp assembly38.dbsnp138.vcf.gz -O $chr.vcf --native-pair-hmm-threads 10 -L $chr &
    pids+=" $!"
  done;

Similar logic for BaseRecalibrator and ApplyBQSR Merging is like,

gatk --java-options "-Xmx16g" MergeVcfs \
  I=$out1/variantCaller/HaplotypeCaller/chr1.vcf \
  ..
  I=$out1/variantCaller/HaplotypeCaller/chr22.vcf \
  I=$out1/variantCaller/HaplotypeCaller/chrX.vcf \
  I=$out1/variantCaller/HaplotypeCaller/chrY.vcf \
  O=$out1/variantCaller/HaplotypeCaller/output_variants_16-chunk_merged.vcf.gz

However, the output after merging them is not same as when running as entire genome together.

Also for ApplyBQSR, the summation of bam file size of individual chromosome is not equal to bam file size when passed without breaking.

Questions-

  1. Is this approach legitimate? if yes, what can I do to match the output.
  2. If it wont match ever, how can I make bioinformatics team to understand it is correct approach.
  3. If the above approach for scatter gather is wrong, what can I do next?
haplotypecaller gatk bqsr • 442 views
ADD COMMENT

Login before adding your answer.

Traffic: 2996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6