How to get intersected variants between multiple vcf files?
Entering edit mode
6 weeks ago
Dana ▴ 20


I have 6 replicates from an experiment, and I run a variant calling algorithm for each of them, which results in a VCF file for each replicate.

I now try to combine the six resulted VCF files into a single VCF file containing only the intersected variants. Meaning, only variants appearing in every replicate should be considered. For that reason, I used the following command:

bcftools isec A.vcf.gz B.vcf.gz C.vcf.gz D.vcf.gz E.vcf.gz F.vcf.gz --nfiles=6 -c all --output-type v > out_file.txt

Which output a single file containing all the variants that overlap in their position and reference among all replicates. Unfortunately, using

-c all

Implicates the variants in the out_file.txt don't necessarily have the same ALT allele. I tried using "none", "some" and "indels", but it doesn't solve my problem, since I have several variants in the same positions that are not considered as the same variant. For example, I have the following variants in the same position and chromosome:

C -> CT in replicates A, B, D
C -> CTTT in replicates C, F
C -> CTTTT in replicate E

I expected to get C->CT in the out_file.txt as it is the overlapping variant but I don't get it for some reason. Should I use a different tool? different parameters?

Thanks in advance!

vcftools intersection gatk isec bcftools • 299 views
Entering edit mode
6 weeks ago

i would use vt to normalize and decompose your vcfs first - that should put all the multiallelic events on different lines

I have a fork of intervene that can produce UpSet plots of vcf intersections. It uses bedtools not bcftools to intersect but I'm not sure that is any more sensitive to alternate allele - it might use position only?


Entering edit mode

Thanks for your reply!

First, all of the VCF files are already decomposed. Second, I tried bedtools intersect before and it is not more sensitive. There is only one factor that determines the reported "intersected variant": which file was given first in the intersection. Meaning, running the command:

intersectBed -a A.vcf -b C.vcf | grep position 

Will result:

chr position . C CT

But running the command:

intersectBed -a C.vcf -b A.vcf | grep position 

Will result:

chr position . C CTTT

But I believe the correct result for the overlapping variant should be C->CT, without considering which file was given first.

BTW, I checked out intervene and it looks awesome!


Login before adding your answer.

Traffic: 849 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6