Hi all,
I am trying to find SNP/Indel in a dataset using both "GATK" and "bcftools call". I find two big differences for generating VCF files with these tools and I appreciate your help and experience:
1- "GATK" result has higher "DP" value than "bcftools", any clue? e.g.
GATK.vcf ==> NC_000 747 . A G 4054 . AC=1;AF=1.00;AN=1;DP=113;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.98;MQ0=0;QD=28.10;SOR=0.769 GT:AD:DP:GQ:PL 1:0,108:108:99:4084,0
bcftools.vcf ==> NC_000913.3 747 . A G 228 . DP=76;VDB=0.987864;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,20,56;MQ=59 GT:PL 1/1:255,229,0
2- Why does GATK take much longer time(30-60 mins difference) compared to bcftools? is it normal or someoothing wrong with my usage?
Pipelines I used:
GATK pipeline ==> java -jar ./GATK3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I sample.bam -stand_call_conf 30 -stand_emit_conf 10 -o GATK.vcf
bcftools pipeline ==> samtools mpileup -uf reference.fasta sample.bam | ./bcftools/bcftools call -m -v -o bcftools.vcf
Thanks in advance
Yes, Haplotypecaller is quite slow. You could try to use a target file (in case you're not doing genome sequencing, depending on project). use -L targets.bed -ip 50 (ip: intervalpadding).
Well I am working on a bacterial genome that is why dbsnp doesn't work for it. And I read that GATK is not good to be used for bacterial genome. But still, it is strange to me that GATK show more Read Depth (DP) compared to the other tool Bcftools.
I agree with you it is weird that they have different DPs for the same SNP did you use BWA MEM for both alignments or different aligners?
Yes, I used BWA MEM for both alignment. do you feel something wrong? I also updated the post to add the whole two lines of VCF files in case it is more informative. One more things that seems strange, for Diploid the GT has two value looks like this (0/1, 0/0). But for Haploid the GT has one value. So my data is Bacteria which is Haploid and true for GATK which is one value while in Bcftools it is two values! that might tells us something too.