How to get intersected variants between multiple vcf files?
1
1
Entering edit mode
20 months ago
Dana ▴ 20

Hey!

I have 6 replicates from an experiment, and I run a variant calling algorithm for each of them, which results in a VCF file for each replicate.

I now try to combine the six resulted VCF files into a single VCF file containing only the intersected variants. Meaning, only variants appearing in every replicate should be considered. For that reason, I used the following command:

bcftools isec A.vcf.gz B.vcf.gz C.vcf.gz D.vcf.gz E.vcf.gz F.vcf.gz --nfiles=6 -c all --output-type v > out_file.txt

Which output a single file containing all the variants that overlap in their position and reference among all replicates. Unfortunately, using

-c all

Implicates the variants in the out_file.txt don't necessarily have the same ALT allele. I tried using "none", "some" and "indels", but it doesn't solve my problem, since I have several variants in the same positions that are not considered as the same variant. For example, I have the following variants in the same position and chromosome:

C -> CT in replicates A, B, D
C -> CTTT in replicates C, F
C -> CTTTT in replicate E

I expected to get C->CT in the out_file.txt as it is the overlapping variant but I don't get it for some reason. Should I use a different tool? different parameters?

Thanks in advance!

bcftools vcftools gatk • 1.7k views
ADD COMMENT
1
Entering edit mode
20 months ago

i would use vt to normalize and decompose your vcfs first - that should put all the multiallelic events on different lines

https://genome.sph.umich.edu/wiki/Vt

I have a fork of intervene that can produce UpSet plots of vcf intersections. It uses bedtools not bcftools to intersect but I'm not sure that is any more sensitive to alternate allele - it might use position only?

https://github.com/truwl/intervene

upset

ADD COMMENT
1
Entering edit mode

Thanks for your reply!

First, all of the VCF files are already decomposed. Second, I tried bedtools intersect before and it is not more sensitive. There is only one factor that determines the reported "intersected variant": which file was given first in the intersection. Meaning, running the command:

intersectBed -a A.vcf -b C.vcf | grep position 

Will result:

chr position . C CT

But running the command:

intersectBed -a C.vcf -b A.vcf | grep position 

Will result:

chr position . C CTTT

But I believe the correct result for the overlapping variant should be C->CT, without considering which file was given first.

BTW, I checked out intervene and it looks awesome!

ADD REPLY

Login before adding your answer.

Traffic: 1925 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6