Variant Calls: Freebayes vs GATK
1
1
Entering edit mode
6.1 years ago
oars ▴ 200

I'm calling variants using three input/reference files (dedup.bam; genome.fa; and chr17.cds.bed) and creating two vcf files, (1) for gatk.vcf and (2) freebayes.vcf.

GATK returned more variants (POS) and more dbSNPs. Scanning the files you quickly notice the different quality scores between the two files, with far greater range in QUAL (super low and high scores) found in the freebies.vcf file. What factors contribute to the higher variant count in GATK vcf files compared to freebayes?

vcf freebayes gatk • 9.2k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
3
Entering edit mode

Thanks! vcftools has a feature called --diff

vcftools --vcf SRR1611183.gatk.vcf --diff SRR1611183.freebayes.vcf --diff-site --out gatk_freebayes.diff

It creates a neat outfile with the following contents;

CHROM   POS1    POS2    IN_File  REF1 REF2 ALT1 ALT2
chr17   5036281 5036281 B   G   G   C   C
chr17   5036732 .       1   C   .   A   .
chr17   5036740 5036740 B   C   C   G   G
chr17   5036748 5036748 B   G   G   T   T
chr17   .       5036761 2   .   A   .   C
chr17   .       5036784 2   .   C   .   T

I'm trying to find some consistent themes as to the variant calling discrepancies.

ADD REPLY
0
Entering edit mode

When examining the tail end of data found in the INFO column, you'll notice a difference between GATK and Freebayes:

GATK
GT:AD:DP:GQ:PL  1/1:0,65:65:99:2535,196,0

Freebayes
GT:DP:RO:QR:AO:QA:GL    1/1:64:0:0:64:2261:-5,-5,0

Can anyone decipher this information?

ADD REPLY
3
Entering edit mode

Hello,

Have a look at the header of your vcf files. All these entrys should be described under FORMAT.

fin swimmer

ADD REPLY
1
Entering edit mode

How big is the difference of the number of variants between them?

One reason can be that freebayes describes multiple variants that are close together as one haplotype if they can be asigned to one allele. Whereas GATK maybe report every change seperately.

Fin swimmer

ADD REPLY
6
Entering edit mode
6.1 years ago
vdauwera ★ 1.2k

Keep in mind that the GATK variant callers are designed to be as sensitive as possible and will therefore include many false positives, so you need to apply some filtering steps after calling to remove those false positives, as described in the GATK Best Practices. It's essentially impossible to answer your question without knowing more about how you did the variant calling in both cases, and what kind of filtering and evaluation you did on the results.

It's also important to understand that QUAL scores are calculated differently by different variant callers, so it's tricky to compare them directly. You'll get more insights from evaluating your results relative to known callsets or truth sets.

ADD COMMENT
0
Entering edit mode

Many thanks for your reply! Here are the two call scripts, maybe this would be insightful:

$ GATK HaplotypeCaller -I SRR1611183.dedup.bam -O SRR1611183.gatk.vcf -R genome.fa -L chr17.cds.bed

and for freebayes...

$ freebayes -f genome.fa -m 20 -q 10 -t chr17.cds.bed SRR1611183.dedup.bam > SRR1611183.freebayes.vcf
ADD REPLY

Login before adding your answer.

Traffic: 2492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6