Why GATK and bcftools SNP calling different?
6
5
Entering edit mode
5.4 years ago
fernardo ▴ 160

Hi all,

I am trying to find SNP/Indel in a dataset using both "GATK" and "bcftools call". I find two big differences for generating VCF files with these tools and I appreciate your help and experience:

1- "GATK" result has higher "DP" value than "bcftools", any clue? e.g.

GATK.vcf ==> NC_000 747 . A G 4054 . AC=1;AF=1.00;AN=1;DP=113;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.98;MQ0=0;QD=28.10;SOR=0.769 GT:AD:DP:GQ:PL 1:0,108:108:99:4084,0

bcftools.vcf ==> NC_000913.3 747 . A G 228 . DP=76;VDB=0.987864;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,20,56;MQ=59 GT:PL 1/1:255,229,0

2- Why does GATK take much longer time(30-60 mins difference) compared to bcftools? is it normal or someoothing wrong with my usage?

Pipelines I used:

 GATK pipeline ==> java -jar ./GATK3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I sample.bam -stand_call_conf 30 -stand_emit_conf 10 -o GATK.vcf

bcftools pipeline ==> samtools mpileup -uf reference.fasta sample.bam | ./bcftools/bcftools call -m -v -o bcftools.vcf


SNP next-gen GATK bcftools variant calling • 5.9k views
0
Entering edit mode

Yes, Haplotypecaller is quite slow. You could try to use a target file (in case you're not doing genome sequencing, depending on project). use -L targets.bed -ip 50 (ip: intervalpadding).

0
Entering edit mode

Well I am working on a bacterial genome that is why dbsnp doesn't work for it. And I read that GATK is not good to be used for bacterial genome. But still, it is strange to me that GATK show more Read Depth (DP) compared to the other tool Bcftools.

0
Entering edit mode

I agree with you it is weird that they have different DPs for the same SNP did you use BWA MEM for both alignments or different aligners?

0
Entering edit mode

Yes, I used BWA MEM for both alignment. do you feel something wrong? I also updated the post to add the whole two lines of VCF files in case it is more informative. One more things that seems strange, for Diploid the GT has two value looks like this (0/1, 0/0). But for Haploid the GT has one value. So my data is Bacteria which is Haploid and true for GATK which is one value while in Bcftools it is two values! that might tells us something too.

4
Entering edit mode
5.4 years ago
kirannbishwa01 ★ 1.3k

HaplotypeCaller from GATK is a haplotype based caller. It takes the aligned *.bam file and then realigns it again while calling the polymorphisms simulataneously at the loci; by default it calls maximum of 6 alternate alleles at any locus and output 2 best ones.

Bcftools isn't a haplotype based caller, it takes the aligned bam file and output polymporhisms without doing any realignment. If you happen to use UnifiedGenotyper from GATK its way faster since its not a haplotype based caller. For the samples I have worked HaplotypeCaller took almost 3 days and UnifiedGenotyper would take on average 30-50 minutes to complete the jobs.

But, polymorphisms data from HaplotypeCaller are considered to be of greater confidence.

Thanks,

1
Entering edit mode

Thanks. Yes HaplotypeCaller takes much longer but based on GATK documentation, this one is a newer version and better to be used. And I think it is also used for both Diploid and non-Diploid organism(data). Which one do you personally use, GATK or Bcftools?

2
Entering edit mode
5.4 years ago
kirannbishwa01 ★ 1.3k

GATK report DP values twice: one under INFO which generally is the filered DP values and one for the FORMAT which generally is the unfiltered DP values for that allele.

You have different DP due the different statisticial methods employed by different variant callers. Also they have different filtering parameters they might have by default. If you ran the variant calling using default parameters, please check what are default filtering parameter between GATK and bcftools to understand your DP values more clearly.

0
Entering edit mode

Thanks again. The answer makes sense. As I also asked above, if one needs to use DP value for a type of research, as we discussed there are differences between DP value of the two tools, so which one to be used the research? e.g. if for a SNP from GATK DP is 125 but from bcftools is 97. Just in case if you have an idea.

0
Entering edit mode

2
Entering edit mode
5.3 years ago
fernardo ▴ 160

Solved!! After researching around this case, shortly, I came up with the solution as follows:

Doing "samtools tview" you can see three types of reads:

1- "." : positive strands.

2- "," : negative strands

GATK ==> sums all three of them including Orphan reads Bcftools ==> sums only 1 and 2. Doesn't count for Orphan reads.

A questions comes up which I will be asking in a different post: - Which one is recommended, removing orphan reads or not?

Note: to remove Orphan reads, samtools and bamtools can be used.

bamtools filter -isProperPair true -in File.bam -out File_no_orphan.bam

0
Entering edit mode

Well, some searching around paid off. My suggestions has always been GATK - due to its good capabilities to detect greater amount of true positives. I don't think you will need to remove orphan reads - the reason being that the other mate pair may not have mapped due to large indels and it can actually happen when you align datasets from other population than the one reference was generated from.

0
Entering edit mode

Yes, right. But I also saw suggestions around that it is recommended to remain only proper pair reads and remove orphans. I have to also think about this to make a decision :)

1
Entering edit mode
5.4 years ago
skbrimer ▴ 650

I think I know why, so I was wondering in my first comment if the aligners where the same. I am going to assume that they are the same aligner and if you did your workflow like this it would explain the difference.

    Clean Reads
|
V
|       |
V       V
1. Samtools   2. GATK
1. bcftools   2.GATK Haplotype


This is because the GATK pipeline realigns its Indels, so you could have less error in those regions and more reads would be remapped back to them, whereas the first mapping is all you work with in the samtools pipeline.

0
Entering edit mode

Thanks. Yes I used the same aligner. In experiment, both results were the same, is it also fine? so it is possible sometimes to have the same DP value and sometimes not? One more important question, based on your experience, if one needs to use DP value for a type of project, as we discussed there are differences between DP value of the two tools, so which one to be used? just in case if you have an idea.

1
Entering edit mode
5.4 years ago
kirannbishwa01 ★ 1.3k

For more true positive data GATK is my preference, actually I have almost always used GATK for calling variants.

Regarding the DP value you will need to do some research by yourself. You will need to check the global coverage, local coverage and the coverage (DP value) you want to use as a cutoff to pull variants from that particular loci.

Hope that helps,

1
Entering edit mode
5.4 years ago
chen ★ 2.1k

Different tools give much difference for low-frequency mutations.