Question

FreeBayes vs GATK's HaplotypeCaller

3

Entering edit mode

9.5 years ago

Hatem Elshazly ▴ 60

Hello Everyone,

I was trying freebayes on a small bam file and wanted to compare the quality of its output -vcf1- against the quality of the output of gatk halplotypecaller -vcf2- but I found there is a huge difference between both of them, for instance vcf1 was 4.4M and vcf2 was 140M.

When I tried to visualize both files I found that vcf2 has much more variants discovered than vc1.

Does anyone know why there is such difference? Does it have anything to do with gatk using known sites like dbsnp and freebayes don't?

I want to know because I was wondering if you can use FreeBayes instead of GATK in calling variants without affecting quality.

FreeBayes commandline:

freebayes -f human_g1k_v37.fasta out.bam > ~/out.vcf

GATK commandline:

java -jar gatk3.jar \
  -T HaplotypeCaller \
  -R human_g1k_v37.fasta \
  -D dbsnp_vcf.vcf \
  -o out.vcf \
  -pairHMM VECTOR_LOGLESS_CACHING \
  --emitRefConfidence GVCF \
  --variant_index_type LINEAR \
  --variant_index_parameter 128000 \
  -A DepthPerAlleleBySample \
  -stand_call_conf 30 \
  -stand_emit_conf 10

Thanks in advance,
Shazly

freebayes vcf gatk quality variant-caller • 12k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.5 years ago by Hatem Elshazly ▴ 60

0

Entering edit mode

140M is number of SNPs?

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by GouthamAtla 12k

0

Entering edit mode

I think file size, based on the following sentence.

ADD REPLY • link 4.3 years ago by jena ▴ 330

3

Entering edit mode

9.5 years ago

DG 7.3k

I believe HaplotypeCaller mostly uses the known SNPs in its genotyping step and not so much variant calling per se. From what I recall from other comparisons Freebayes and Haplotypecaller are pretty concordant with one another, but there will always be differences between the output of variant callers. The majority of these differences will be at low quality sites and for edge-cases. The two callers use similar, but different, algorithms and methods, and the default values for lots of those options will be different from one another.

ADD COMMENT • link 9.5 years ago by DG 7.3k

Ram · Accepted Answer · 2016-01-28

8

Entering edit mode

9.5 years ago

Carlos Borroto ★ 2.1k

I looks like you are assuming every line in vcf2 is a variant. However, you are using --emitRefConfidence GVCF which is asking HaplotypeCaller to make a call on every position, not just variants. If you are following Broad Best Practices, you need to also run GenotypeGVCFs in order to get the actual genotype calls. By the default GenotypeGVCFs will only output variant positions. This is the VCF you should compare to the Freebayes output.

ADD COMMENT • link 9.5 years ago by Carlos Borroto ★ 2.1k

1

Entering edit mode

Just noticed that, I didn't scroll to the right to see the rest of the command line. As a note Freebayes now has the ability to output GVCF if desired. But yes the two outputs from these commands clearly won't be equivalent.

ADD REPLY • link 9.5 years ago by DG 7.3k

0

Entering edit mode

I've just checked the output of GenotypeGVCFs against Freebayes but this time the VCF of GenotypeGVCFs is 981K where as Freebayes is 4.4M.

Is this large gap inherent between them (Freebayes and GenotypeGVCFs) or it will shrink when I tune the options? because I want to know if I can safely replace GATK (HaplotypeCaller and GenotypeGVCFs) with Freebayes.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by Hatem Elshazly ▴ 60

2

Entering edit mode

FreeBayes is considered to be a quite good variant caller. Which one is better between the two is a matter of debate, and I don't think I've seen any comparisons that led me to believe one is significantly better than the other. Brad Chapman has done quite a number of write-ups around variant calling on various "truth sets" and comparing different results at his blog Blue Collar Bioinformatics.

He has also done a fair amount of work with bcbio (pipeline software) for implementing ensemble variant calling, which tends to outperform using a single variant caller.

You really shouldn't be looking at just the output of the two programs in a haphazard way in order to decide which to do. You need to do some sort of controlled comparison on known data to evaluate whether either of them is missing variants you expect to be called or giving a significant number of false positives. But you also need to be doing that comparison fairly by doing parameter optimization and tuning and doing comparable runs.

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.5 years ago by DG 7.3k

0

Entering edit mode

Great. Thanks!

ADD REPLY • link 9.5 years ago by Hatem Elshazly ▴ 60