Question: FreeBayes vs GATK's HaplotypeCaller
gravatar for Hatem Elshazly
4.5 years ago by
Hatem Elshazly60 wrote:

Hello Everyone,

I was trying freebayes on a small bam file and wanted to compare the quality of its output -vcf1- against the quality of the output of gatk halplotypecaller -vcf2- but I found there is a huge difference between both of them, for instance vcf1 was 4.4M and vcf2 was 140M.

When I tried to visualize both files I found that vcf2 has much more variants discovered than vc1.

Does anyone know why there is such difference? Does it have anything to do with gatk using known sites like dbsnp and freebayes don't?

I want to know because I was wondering if you can use FreeBayes instead of GATK in calling variants without affecting quality.

FreeBayes commandline:

freebayes -f human_g1k_v37.fasta out.bam > ~/out.vcf

GATK commandline:

java -jar gatk3.jar -T HaplotypeCaller -R human_g1k_v37.fasta -D dbsnp_vcf.vcf -o out.vcf -pairHMM VECTOR_LOGLESS_CACHING --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -A DepthPerAlleleBySample -stand_call_conf 30 -stand_emit_conf 10


Thanks in advance,




ADD COMMENTlink modified 4.5 years ago by Carlos Borroto1.9k • written 4.5 years ago by Hatem Elshazly60

140M is number of SNPs?

ADD REPLYlink modified 6 months ago by RamRS27k • written 4.5 years ago by geek_y11k
gravatar for Carlos Borroto
4.5 years ago by
Carlos Borroto1.9k
Washington Metropolitan Area
Carlos Borroto1.9k wrote:

I looks like you are assuming every line in vcf2 is a variant. However, you are using --emitRefConfidence GVCF which is asking HaplotypeCaller to make a call on every position, not just variants. If you are following Broad Best Practices, you need to also run GenotypeGVCFs in order to get the actual genotype calls. By the default GenotypeGVCFs will only output variant positions. This is the VCF you should compare to the Freebayes output.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Carlos Borroto1.9k

Just noticed that, I didn't scroll to the right to see the rest of the command line. As a note Freebayes now has the ability to output GVCF if desired. But yes the two outputs from these commands clearly won't be equivalent.

ADD REPLYlink written 4.5 years ago by DG7.1k

I've just checked the output of GenotypeGVCFs against Freebayes but this time the VCF of GenotypeGVCFs is 981K where as Freebayes is 4.4M.

Is this large gap inherent between them (Freebayes and GenotypeGVCFs) or it will shrink when I tune the options? because I want to know if I can safely replace GATK (HaplotypeCaller and GenotypeGVCFs) with Freebayes.

ADD REPLYlink modified 6 months ago by RamRS27k • written 4.5 years ago by Hatem Elshazly60

FreeBayes is considered to be a quite good variant caller. Which one is better between the two is a matter of debate, and I don't think I've seen any comparisons that led me to believe one is significantly better than the other. Brad Chapman has done quite a number of write-ups around variant calling on various "truth sets" and comparing different results at his blog Blue Collar Bioinformatics.

He has also done a fair amount of work with bcbio (pipeline software) for implementing ensemble variant calling, which tends to outperform using a single variant caller.

You really shouldn't be looking at just the output of the two programs in a haphazard way in order to decide which to do. You need to do some sort of controlled comparison on known data to evaluate whether either of them is missing variants you expect to be called or giving a significant number of false positives. But you also need to be doing that comparison fairly by doing parameter optimization and tuning and doing comparable runs.

ADD REPLYlink modified 6 months ago by RamRS27k • written 4.5 years ago by DG7.1k

Great. Thanks!

ADD REPLYlink written 4.5 years ago by Hatem Elshazly60
gravatar for DG
4.5 years ago by
DG7.1k wrote:

I believe HaplotypeCaller mostly uses the known SNPs in its genotyping step and not so much variant calling per se. From what I recall from other comparisons Freebayes and Haplotypecaller are pretty concordant with one another, but there will always be differences between the output of variant callers. The majority of these differences will be at low quality sites and for edge-cases. The two callers use similar, but different, algorithms and methods, and the default values for lots of those options will be different from one another.

ADD COMMENTlink written 4.5 years ago by DG7.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1630 users visited in the last hour