Question: Why GATK and bcftools SNP calling different?
3
gravatar for fernardo
3.1 years ago by
fernardo 120
Italy
fernardo 120 wrote:

Hi all,

I am trying to find SNP/Indel in a dataset using both "GATK" and "bcftools call". I find two big differences for generating VCF files with these tools and I appreciate your help and experience:

1- "GATK" result has higher "DP" value than "bcftools", any clue? e.g.

GATK.vcf ==> NC_000 747 . A G 4054 . AC=1;AF=1.00;AN=1;DP=113;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.98;MQ0=0;QD=28.10;SOR=0.769 GT:AD:DP:GQ:PL 1:0,108:108:99:4084,0

bcftools.vcf ==> NC_000913.3 747 . A G 228 . DP=76;VDB=0.987864;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,20,56;MQ=59 GT:PL 1/1:255,229,0

2- Why does GATK take much longer time(30-60 mins difference) compared to bcftools? is it normal or someoothing wrong with my usage?

Pipelines I used:

 GATK pipeline ==> java -jar ./GATK3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I sample.bam -stand_call_conf 30 -stand_emit_conf 10 -o GATK.vcf 

bcftools pipeline ==> samtools mpileup -uf reference.fasta sample.bam | ./bcftools/bcftools call -m -v -o bcftools.vcf

Thanks in advance

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by fernardo 120

Yes, Haplotypecaller is quite slow. You could try to use a target file (in case you're not doing genome sequencing, depending on project). use -L targets.bed -ip 50 (ip: intervalpadding).

ADD REPLYlink written 3.1 years ago by WouterDeCoster38k

Well I am working on a bacterial genome that is why dbsnp doesn't work for it. And I read that GATK is not good to be used for bacterial genome. But still, it is strange to me that GATK show more Read Depth (DP) compared to the other tool Bcftools.

ADD REPLYlink written 3.1 years ago by fernardo 120

I agree with you it is weird that they have different DPs for the same SNP did you use BWA MEM for both alignments or different aligners?

ADD REPLYlink written 3.1 years ago by skbrimer530

Yes, I used BWA MEM for both alignment. do you feel something wrong? I also updated the post to add the whole two lines of VCF files in case it is more informative. One more things that seems strange, for Diploid the GT has two value looks like this (0/1, 0/0). But for Haploid the GT has one value. So my data is Bacteria which is Haploid and true for GATK which is one value while in Bcftools it is two values! that might tells us something too.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by fernardo 120
3
gravatar for kirannbishwa01
3.1 years ago by
United States
kirannbishwa01980 wrote:

HaplotypeCaller from GATK is a haplotype based caller. It takes the aligned *.bam file and then realigns it again while calling the polymorphisms simulataneously at the loci; by default it calls maximum of 6 alternate alleles at any locus and output 2 best ones.

Bcftools isn't a haplotype based caller, it takes the aligned bam file and output polymporhisms without doing any realignment. If you happen to use UnifiedGenotyper from GATK its way faster since its not a haplotype based caller. For the samples I have worked HaplotypeCaller took almost 3 days and UnifiedGenotyper would take on average 30-50 minutes to complete the jobs.

But, polymorphisms data from HaplotypeCaller are considered to be of greater confidence.

Thanks,

ADD COMMENTlink written 3.1 years ago by kirannbishwa01980
1

Thanks. Yes HaplotypeCaller takes much longer but based on GATK documentation, this one is a newer version and better to be used. And I think it is also used for both Diploid and non-Diploid organism(data). Which one do you personally use, GATK or Bcftools?

ADD REPLYlink written 3.1 years ago by fernardo 120
2
gravatar for kirannbishwa01
3.1 years ago by
United States
kirannbishwa01980 wrote:

GATK report DP values twice: one under INFO which generally is the filered DP values and one for the FORMAT which generally is the unfiltered DP values for that allele.

You have different DP due the different statisticial methods employed by different variant callers. Also they have different filtering parameters they might have by default. If you ran the variant calling using default parameters, please check what are default filtering parameter between GATK and bcftools to understand your DP values more clearly.

ADD COMMENTlink written 3.1 years ago by kirannbishwa01980

Thanks again. The answer makes sense. As I also asked above, if one needs to use DP value for a type of research, as we discussed there are differences between DP value of the two tools, so which one to be used the research? e.g. if for a SNP from GATK DP is 125 but from bcftools is 97. Just in case if you have an idea.

ADD REPLYlink written 3.1 years ago by fernardo 120

Click the "Add reply" button to reply to the comment, instead of making a new answer each time.

ADD REPLYlink written 3.1 years ago by andrew.j.skelton735.6k
2
gravatar for fernardo
3.1 years ago by
fernardo 120
Italy
fernardo 120 wrote:

Solved!! After researching around this case, shortly, I came up with the solution as follows:

Doing "samtools tview" you can see three types of reads:

1- "." : positive strands.

2- "," : negative strands

3- "underlined reads" : Orphan reads (means the other mate-read not aligned).

GATK ==> sums all three of them including Orphan reads Bcftools ==> sums only 1 and 2. Doesn't count for Orphan reads.

A questions comes up which I will be asking in a different post: - Which one is recommended, removing orphan reads or not?

Note: to remove Orphan reads, samtools and bamtools can be used.

bamtools filter -isProperPair true -in File.bam -out File_no_orphan.bam

ADD COMMENTlink written 3.1 years ago by fernardo 120

Well, some searching around paid off. My suggestions has always been GATK - due to its good capabilities to detect greater amount of true positives. I don't think you will need to remove orphan reads - the reason being that the other mate pair may not have mapped due to large indels and it can actually happen when you align datasets from other population than the one reference was generated from.

ADD REPLYlink written 3.1 years ago by kirannbishwa01980

Yes, right. But I also saw suggestions around that it is recommended to remain only proper pair reads and remove orphans. I have to also think about this to make a decision :)

ADD REPLYlink written 3.1 years ago by fernardo 120
1
gravatar for skbrimer
3.1 years ago by
skbrimer530
United States
skbrimer530 wrote:

I think I know why, so I was wondering in my first comment if the aligners where the same. I am going to assume that they are the same aligner and if you did your workflow like this it would explain the difference.

    Clean Reads
         |
         V
     Map Reads
     |       |
     V       V
1. Samtools   2. GATK
1. bcftools   2.GATK Haplotype

This is because the GATK pipeline realigns its Indels, so you could have less error in those regions and more reads would be remapped back to them, whereas the first mapping is all you work with in the samtools pipeline.

ADD COMMENTlink written 3.1 years ago by skbrimer530

Thanks. Yes I used the same aligner. In experiment, both results were the same, is it also fine? so it is possible sometimes to have the same DP value and sometimes not? One more important question, based on your experience, if one needs to use DP value for a type of project, as we discussed there are differences between DP value of the two tools, so which one to be used? just in case if you have an idea.

ADD REPLYlink written 3.1 years ago by fernardo 120
1
gravatar for kirannbishwa01
3.1 years ago by
United States
kirannbishwa01980 wrote:

For more true positive data GATK is my preference, actually I have almost always used GATK for calling variants.

Regarding the DP value you will need to do some research by yourself. You will need to check the global coverage, local coverage and the coverage (DP value) you want to use as a cutoff to pull variants from that particular loci.

Hope that helps,

ADD COMMENTlink written 3.1 years ago by kirannbishwa01980
1
gravatar for chen
3.1 years ago by
chen1.8k
OpenGene
chen1.8k wrote:

Different tools give much difference for low-frequency mutations.

ADD COMMENTlink written 3.1 years ago by chen1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1600 users visited in the last hour