Question

Intersectbed - Overlap Analysis Usign Vcf And Bed Files

0

Entering edit mode

11.8 years ago

Khader Shameer 18k

I am trying to do an overlap analysis between 200 danish exomes (VCF courtsey: Zev) and 10 different gene regions.
I would like to know what percentage overlaps between my region of interest (in mygenes.bed total of 36 lines representing the region) and a VCF file (Danish_*.flt.vcf.gz).

I have tried this command and got result: intersectBed -a Danish1.flt.vcf.gz -b mygenes.bed > D1result.txt

Danish1.flt.vcf.gz: here mygenes.bed: here D1overlapped.txt: here

My assumption is that the output should have lines <= the total number of lines in the mygenes.bed file. But in many instances I am getting more than 36 lines as output. May be am missing something important or may be another tool / option in bedtools can do this task more efficiently. Please let me know your thoughts.

bedtools • 9.2k views

ADD COMMENT • link updated 11.8 years ago by Sukhi Singh 11k • written 11.8 years ago by Khader Shameer 18k

score 5 · Answer 1 · 2012-07-12

Hi,

First of all, I would remove the header from the files to be intersected, like your first file, Danish1.flt.vcf.gz has a 23 line header.

gunzip -c Danish_1.flt.vcf.gz | sed 1,23d | intersectBed -u -a stdin -b mygenes.bed > output

Second, you are getting 38 lines in the output, because first possibility is that more than two exome co-coordinates are intersected between same gene, you can control that by "-u" and secondly one exome co-ordinates intersects with two or more genes (which is not the case here). So, you are getting 38 because either 38 of your exomes intersect with the genes you have given or few exomes intersect twice with same gene (which might not be the case). You can use parameter "-wb", which will output the match of a in b file, so you have an idea, but you have to drop "-u" in that case.

gunzip -c Danish_1.flt.vcf.gz | sed 1,23d | intersectBed -a stdin -b mygenes.bed -wb > output

If you change -a to mygenes.bed and -b to vcf file, you will get how many genes intersect with your vcf and uniqueness can be again controlled by "-u" parameter.

-u Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B. Restricted by -f and -r.

The manual has few more parameters if you like (specially "-f" - minimum overlap threshold)

I hope it helps.

Sukhi