Question: VCFtools intersect vs. GATK intersect
2
gravatar for ciemanek
3.0 years ago by
ciemanek140
The Netherlands/Amsterdam
ciemanek140 wrote:

I need to intersect multiple VCF files. I've been trying to use vcf-isec from VCFtools and GATK CombineVariants and later on SelectVariants as follows:

VCFtools:

vcf-isec -f -n =4 input1.vcf.gz input2.vcf.gz input3.vcf.gz input4.vcf.gz > output.vcf

GATK:

java -Xmx2g -jar GenomeAnalysisTK.jar -T CombineVariants -R refSequence.fasta --variant input1.vcf --variant input2.vcf --variant input3.vcf --variant input4.vcf -o output_combined.vcf java -Xmx2g -jar GenomeAnalysisTK.jar -T SelectVariants -R refSequence.fasta -V:variant output_combined.vcf -select 'set=="Intersection";' -o output_intersected.vcf

Although VCFtools gave me 4464 SNPs common for all files while GATK result was 4031 SNPs. VCFtools contains all SNPs identified by GATK plus 433 SNPs.

Where this difference may come from?

ADD COMMENTlink modified 3.0 years ago by Pierre Lindenbaum128k • written 3.0 years ago by ciemanek140

Where this difference may come from?

different way to parse indels ? FILTERed variants ?....

try to print the variants specific to each set

comm -3 <(grep -v "#" output.vcf| cut -f 1,2,4 |  sort | uniq)   <(grep -v "#"  output_intersected.vcf | cut -f 1,2,4 |   sort | uniq)

and the go back the the VCF to see the differences at those points

ADD REPLYlink written 3.0 years ago by Pierre Lindenbaum128k

Thank you for fast response. I will definitely try to invetigate this. Also, in the file resulting from CombineVariants I found SNPs with set=FilteredInAll in the INFO column. Does it mean that GATK performes some additional filtering?

ADD REPLYlink written 3.0 years ago by ciemanek140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1227 users visited in the last hour