I'm calling variants using three input/reference files (dedup.bam; genome.fa; and chr17.cds.bed) and creating two vcf files, (1) for gatk.vcf and (2) freebayes.vcf.
GATK returned more variants (POS) and more dbSNPs. Scanning the files you quickly notice the different quality scores between the two files, with far greater range in QUAL (super low and high scores) found in the freebies.vcf file. What factors contribute to the higher variant count in GATK vcf files compared to freebayes?
Of interest: Variant callers reporting different read depth on same alignment
Thanks! vcftools has a feature called --diff
It creates a neat outfile with the following contents;
I'm trying to find some consistent themes as to the variant calling discrepancies.
When examining the tail end of data found in the INFO column, you'll notice a difference between GATK and Freebayes:
Can anyone decipher this information?
Have a look at the header of your vcf files. All these entrys should be described under FORMAT.
How big is the difference of the number of variants between them?
One reason can be that freebayes describes multiple variants that are close together as one haplotype if they can be asigned to one allele. Whereas GATK maybe report every change seperately.