Question

Intersection of a BED genes with multiple VCFs: report only if at least N intersection.

1

Entering edit mode

6.9 years ago

乙 ▴ 210

Hello everyone,

Here's my problem: I have multiple VCF outputs that I have obtained using MuTect2 for 18 patients in total from Whole Exome Sequencing (WES/WXS) (so 18 VCF files). I have a BED file which has the list of all genes. FYI, MuTect2 reports single variants and INDELS in the VCF.

So, if I use bedtools intersect, it report each feature in A that overlaps B, even if its only once. What I want is to do an intersection between the list of genes in BED and all the VCF outputs, for example, at least 5 VCF that reports the same gene in the region. Bedtools describe here:

intersectBed

if I use:

intersectBed -wa -a [BED] -b [all VCFs]

It will report the list of all genes where the variant has been seen at least once, for example:

intersect

In the previous figure, you can see that bedtools (BT) will report gene B but I do not want that. What I want is that bedtools report gene A only if 5 VCF (or more) that have variants/INDELS overlap the same gene.

Is there any way to do it ? If there are other tools that you can suggest also, please let me know. Thank you in advance !

Alaa

WES bedtools VCF BED Intersection • 3.0k views

ADD COMMENT • link updated 6.9 years ago by Alex Reynolds 35k • written 6.9 years ago by 乙 ▴ 210

score 4 · Accepted Answer · 2017-05-15

BEDOPS bedmap and vcf2bed will do what you want:

$ bedmap --count --echo --echo-map-id-uniq --delim '\t' genes.bed <(vcf2bed < snps.vcf) | awk '($1>=5)' | cut -f2- > answer.bed

You can further constrain the VCF conversion step with --snvs, --insertions, and --deletions. See the documentation in vcf2bed --help or online for more detail.

If you have a lot of VCF files and you want to collate them into one file, so that you can map them in one pass, you can union them into a single BED with bedops --everything:

$ bedops --everything <(vcf2bed < A.vcf) <(vcf2bed < B.vcf) ... <(vcf2bed < N.vcf) > unionSNPs.bed

Then you would map against the unioned SNPs file:

$ bedmap --count --echo --echo-map-id-uniq --delim '\t' genes.bed unionSNPs.bed | awk '($1>=5)' | cut -f2- > answer.bed

Either way, putting --count first in the list of options to bedmap prints the number of overlapping SNPs in the first column of the output. The awk step filters on this value, printing any mapped elements that have 5 or more overlapping SNPs. Then the cut step removes the column with the number of overlaps, leaving a proper BED file as output.

The file answer.bed will have the gene and the IDs of any overlapping SNPs. If you want the SNP information in entirety, or their positions, you can use other --echo-map-* options. Or if you don't care about the SNPs, you can leave out --echo-map-id-uniq and just print the gene with --echo. See the documentation in bedmap --help or online for more detail.