Question: Variant Calls: Freebayes vs GATK
0
gravatar for oars
12 months ago by
oars150
oars150 wrote:

I'm calling variants using three input/reference files (dedup.bam; genome.fa; and chr17.cds.bed) and creating two vcf files, (1) for gatk.vcf and (2) freebayes.vcf.

GATK returned more variants (POS) and more dbSNPs. Scanning the files you quickly notice the different quality scores between the two files, with far greater range in QUAL (super low and high scores) found in the freebies.vcf file. What factors contribute to the higher variant count in GATK vcf files compared to freebayes?

freebayes gatk .vcf • 1.5k views
ADD COMMENTlink modified 12 months ago by vdauwera890 • written 12 months ago by oars150
1

Of interest: Variant callers reporting different read depth on same alignment

ADD REPLYlink written 12 months ago by genomax64k
2

Thanks! vcftools has a feature called --diff

vcftools --vcf SRR1611183.gatk.vcf --diff SRR1611183.freebayes.vcf --diff-site --out gatk_freebayes.diff

It creates a neat outfile with the following contents;

CHROM   POS1    POS2    IN_File  REF1 REF2 ALT1 ALT2
chr17   5036281 5036281 B   G   G   C   C
chr17   5036732 .       1   C   .   A   .
chr17   5036740 5036740 B   C   C   G   G
chr17   5036748 5036748 B   G   G   T   T
chr17   .       5036761 2   .   A   .   C
chr17   .       5036784 2   .   C   .   T

I'm trying to find some consistent themes as to the variant calling discrepancies.

ADD REPLYlink written 12 months ago by oars150

When examining the tail end of data found in the INFO column, you'll notice a difference between GATK and Freebayes:

GATK
GT:AD:DP:GQ:PL  1/1:0,65:65:99:2535,196,0

Freebayes
GT:DP:RO:QR:AO:QA:GL    1/1:64:0:0:64:2261:-5,-5,0

Can anyone decipher this information?

ADD REPLYlink written 12 months ago by oars150
2

Hello,

Have a look at the header of your vcf files. All these entrys should be described under FORMAT.

fin swimmer

ADD REPLYlink written 12 months ago by finswimmer11k
1

How big is the difference of the number of variants between them?

One reason can be that freebayes describes multiple variants that are close together as one haplotype if they can be asigned to one allele. Whereas GATK maybe report every change seperately.

Fin swimmer

ADD REPLYlink written 12 months ago by finswimmer11k
5
gravatar for vdauwera
12 months ago by
vdauwera890
Cambridge, MA
vdauwera890 wrote:

Keep in mind that the GATK variant callers are designed to be as sensitive as possible and will therefore include many false positives, so you need to apply some filtering steps after calling to remove those false positives, as described in the GATK Best Practices. It's essentially impossible to answer your question without knowing more about how you did the variant calling in both cases, and what kind of filtering and evaluation you did on the results.

It's also important to understand that QUAL scores are calculated differently by different variant callers, so it's tricky to compare them directly. You'll get more insights from evaluating your results relative to known callsets or truth sets.

ADD COMMENTlink written 12 months ago by vdauwera890

Many thanks for your reply! Here are the two call scripts, maybe this would be insightful:

$ GATK HaplotypeCaller -I SRR1611183.dedup.bam -O SRR1611183.gatk.vcf -R genome.fa -L chr17.cds.bed

and for freebayes...

$ freebayes -f genome.fa -m 20 -q 10 -t chr17.cds.bed SRR1611183.dedup.bam > SRR1611183.freebayes.vcf
ADD REPLYlink written 11 months ago by oars150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 962 users visited in the last hour