Question: Variant Calls: Freebayes vs GATK
gravatar for oars
20 months ago by
oars160 wrote:

I'm calling variants using three input/reference files (dedup.bam; genome.fa; and chr17.cds.bed) and creating two vcf files, (1) for gatk.vcf and (2) freebayes.vcf.

GATK returned more variants (POS) and more dbSNPs. Scanning the files you quickly notice the different quality scores between the two files, with far greater range in QUAL (super low and high scores) found in the freebies.vcf file. What factors contribute to the higher variant count in GATK vcf files compared to freebayes?

freebayes gatk .vcf • 2.8k views
ADD COMMENTlink modified 20 months ago by vdauwera960 • written 20 months ago by oars160

Of interest: Variant callers reporting different read depth on same alignment

ADD REPLYlink written 20 months ago by genomax75k

Thanks! vcftools has a feature called --diff

vcftools --vcf SRR1611183.gatk.vcf --diff SRR1611183.freebayes.vcf --diff-site --out gatk_freebayes.diff

It creates a neat outfile with the following contents;

CHROM   POS1    POS2    IN_File  REF1 REF2 ALT1 ALT2
chr17   5036281 5036281 B   G   G   C   C
chr17   5036732 .       1   C   .   A   .
chr17   5036740 5036740 B   C   C   G   G
chr17   5036748 5036748 B   G   G   T   T
chr17   .       5036761 2   .   A   .   C
chr17   .       5036784 2   .   C   .   T

I'm trying to find some consistent themes as to the variant calling discrepancies.

ADD REPLYlink written 20 months ago by oars160

When examining the tail end of data found in the INFO column, you'll notice a difference between GATK and Freebayes:

GT:AD:DP:GQ:PL  1/1:0,65:65:99:2535,196,0

GT:DP:RO:QR:AO:QA:GL    1/1:64:0:0:64:2261:-5,-5,0

Can anyone decipher this information?

ADD REPLYlink written 20 months ago by oars160


Have a look at the header of your vcf files. All these entrys should be described under FORMAT.

fin swimmer

ADD REPLYlink written 20 months ago by finswimmer13k

How big is the difference of the number of variants between them?

One reason can be that freebayes describes multiple variants that are close together as one haplotype if they can be asigned to one allele. Whereas GATK maybe report every change seperately.

Fin swimmer

ADD REPLYlink written 20 months ago by finswimmer13k
gravatar for vdauwera
20 months ago by
Cambridge, MA
vdauwera960 wrote:

Keep in mind that the GATK variant callers are designed to be as sensitive as possible and will therefore include many false positives, so you need to apply some filtering steps after calling to remove those false positives, as described in the GATK Best Practices. It's essentially impossible to answer your question without knowing more about how you did the variant calling in both cases, and what kind of filtering and evaluation you did on the results.

It's also important to understand that QUAL scores are calculated differently by different variant callers, so it's tricky to compare them directly. You'll get more insights from evaluating your results relative to known callsets or truth sets.

ADD COMMENTlink written 20 months ago by vdauwera960

Many thanks for your reply! Here are the two call scripts, maybe this would be insightful:

$ GATK HaplotypeCaller -I SRR1611183.dedup.bam -O SRR1611183.gatk.vcf -R genome.fa -L chr17.cds.bed

and for freebayes...

$ freebayes -f genome.fa -m 20 -q 10 -t chr17.cds.bed SRR1611183.dedup.bam > SRR1611183.freebayes.vcf
ADD REPLYlink written 20 months ago by oars160
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1265 users visited in the last hour