VCFtools intersect vs. GATK intersect
0
2
Entering edit mode
5.1 years ago
ciemanek ▴ 140

I need to intersect multiple VCF files. I've been trying to use vcf-isec from VCFtools and GATK CombineVariants and later on SelectVariants as follows:

VCFtools:

vcf-isec -f -n =4 input1.vcf.gz input2.vcf.gz input3.vcf.gz input4.vcf.gz > output.vcf

GATK:

java -Xmx2g -jar GenomeAnalysisTK.jar -T CombineVariants -R refSequence.fasta --variant input1.vcf --variant input2.vcf --variant input3.vcf --variant input4.vcf -o output_combined.vcf java -Xmx2g -jar GenomeAnalysisTK.jar -T SelectVariants -R refSequence.fasta -V:variant output_combined.vcf -select 'set=="Intersection";' -o output_intersected.vcf

Although VCFtools gave me 4464 SNPs common for all files while GATK result was 4031 SNPs. VCFtools contains all SNPs identified by GATK plus 433 SNPs.

Where this difference may come from?

snp sequencing gatk vctfools intersect • 3.9k views
ADD COMMENT
0
Entering edit mode

Where this difference may come from?

different way to parse indels ? FILTERed variants ?....

try to print the variants specific to each set

comm -3 <(grep -v "#" output.vcf| cut -f 1,2,4 |  sort | uniq)   <(grep -v "#"  output_intersected.vcf | cut -f 1,2,4 |   sort | uniq)

and the go back the the VCF to see the differences at those points

ADD REPLY
0
Entering edit mode

Thank you for fast response. I will definitely try to invetigate this. Also, in the file resulting from CombineVariants I found SNPs with set=FilteredInAll in the INFO column. Does it mean that GATK performes some additional filtering?

ADD REPLY

Login before adding your answer.

Traffic: 2219 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6