Question: VCFtools intersect vs. GATK intersect
2
gravatar for ciemanek
17 months ago by
ciemanek110
The Netherlands/Amsterdam
ciemanek110 wrote:

I need to intersect multiple VCF files. I've been trying to use vcf-isec from VCFtools and GATK CombineVariants and later on SelectVariants as follows:

VCFtools:

vcf-isec -f -n =4 input1.vcf.gz input2.vcf.gz input3.vcf.gz input4.vcf.gz > output.vcf

GATK:

java -Xmx2g -jar GenomeAnalysisTK.jar -T CombineVariants -R refSequence.fasta --variant input1.vcf --variant input2.vcf --variant input3.vcf --variant input4.vcf -o output_combined.vcf java -Xmx2g -jar GenomeAnalysisTK.jar -T SelectVariants -R refSequence.fasta -V:variant output_combined.vcf -select 'set=="Intersection";' -o output_intersected.vcf

Although VCFtools gave me 4464 SNPs common for all files while GATK result was 4031 SNPs. VCFtools contains all SNPs identified by GATK plus 433 SNPs.

Where this difference may come from?

ADD COMMENTlink modified 17 months ago by Pierre Lindenbaum114k • written 17 months ago by ciemanek110

Where this difference may come from?

different way to parse indels ? FILTERed variants ?....

try to print the variants specific to each set

comm -3 <(grep -v "#" output.vcf| cut -f 1,2,4 |  sort | uniq)   <(grep -v "#"  output_intersected.vcf | cut -f 1,2,4 |   sort | uniq)

and the go back the the VCF to see the differences at those points

ADD REPLYlink written 17 months ago by Pierre Lindenbaum114k

Thank you for fast response. I will definitely try to invetigate this. Also, in the file resulting from CombineVariants I found SNPs with set=FilteredInAll in the INFO column. Does it mean that GATK performes some additional filtering?

ADD REPLYlink written 17 months ago by ciemanek110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1241 users visited in the last hour