Question: Intersection of a BED genes with multiple VCFs: report only if at least N intersection.
gravatar for badredda
2.6 years ago by
badredda110 wrote:

Hello everyone,

Here's my problem: I have multiple VCF outputs that I have obtained using MuTect2 for 18 patients in total from Whole Exome Sequencing (WES/WXS) (so 18 VCF files). I have a BED file which has the list of all genes. FYI, MuTect2 reports single variants and INDELS in the VCF.

So, if I use bedtools intersect, it report each feature in A that overlaps B, even if its only once. What I want is to do an intersection between the list of genes in BED and all the VCF outputs, for example, at least 5 VCF that reports the same gene in the region. Bedtools describe here:


if I use:

intersectBed -wa -a [BED] -b [all VCFs]

It will report the list of all genes where the variant has been seen at least once, for example:


In the previous figure, you can see that bedtools (BT) will report gene B but I do not want that. What I want is that bedtools report gene A only if 5 VCF (or more) that have variants/INDELS overlap the same gene.

Is there any way to do it ? If there are other tools that you can suggest also, please let me know. Thank you in advance !


intersection bed wes vcf bedtools • 1.2k views
ADD COMMENTlink modified 2.6 years ago by Alex Reynolds29k • written 2.6 years ago by badredda110
gravatar for Alex Reynolds
2.6 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

BEDOPS bedmap and vcf2bed will do what you want:

$ bedmap --count --echo --echo-map-id-uniq --delim '\t' genes.bed <(vcf2bed < snps.vcf) | awk '($1>=5)' | cut -f2- > answer.bed

You can further constrain the VCF conversion step with --snvs, --insertions, and --deletions. See the documentation in vcf2bed --help or online for more detail.

If you have a lot of VCF files and you want to collate them into one file, so that you can map them in one pass, you can union them into a single BED with bedops --everything:

$ bedops --everything <(vcf2bed < A.vcf) <(vcf2bed < B.vcf) ... <(vcf2bed < N.vcf) > unionSNPs.bed

Then you would map against the unioned SNPs file:

$ bedmap --count --echo --echo-map-id-uniq --delim '\t' genes.bed unionSNPs.bed | awk '($1>=5)' | cut -f2- > answer.bed

Either way, putting --count first in the list of options to bedmap prints the number of overlapping SNPs in the first column of the output. The awk step filters on this value, printing any mapped elements that have 5 or more overlapping SNPs. Then the cut step removes the column with the number of overlaps, leaving a proper BED file as output.

The file answer.bed will have the gene and the IDs of any overlapping SNPs. If you want the SNP information in entirety, or their positions, you can use other --echo-map-* options. Or if you don't care about the SNPs, you can leave out --echo-map-id-uniq and just print the gene with --echo. See the documentation in bedmap --help or online for more detail.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Alex Reynolds29k

@Alex Reynolds

Thanks for the help ! I will try it asap !

ADD REPLYlink written 2.6 years ago by badredda110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1100 users visited in the last hour