Intersectbed - Overlap Analysis Usign Vcf And Bed Files
1
0
Entering edit mode
11.8 years ago

I am trying to do an overlap analysis between 200 danish exomes (VCF courtsey: Zev) and 10 different gene regions.
I would like to know what percentage overlaps between my region of interest (in mygenes.bed total of 36 lines representing the region) and a VCF file (Danish_*.flt.vcf.gz).

I have tried this command and got result: intersectBed -a Danish1.flt.vcf.gz -b mygenes.bed > D1result.txt

Danish1.flt.vcf.gz: here mygenes.bed: here D1overlapped.txt: here

My assumption is that the output should have lines <= the total number of lines in the mygenes.bed file. But in many instances I am getting more than 36 lines as output. May be am missing something important or may be another tool / option in bedtools can do this task more efficiently. Please let me know your thoughts.

bedtools • 9.2k views
ADD COMMENT
5
Entering edit mode
11.8 years ago

Hi,

First of all, I would remove the header from the files to be intersected, like your first file, Danish1.flt.vcf.gz has a 23 line header.

gunzip -c Danish_1.flt.vcf.gz | sed 1,23d | intersectBed -u -a stdin -b mygenes.bed > output

Second, you are getting 38 lines in the output, because first possibility is that more than two exome co-coordinates are intersected between same gene, you can control that by "-u" and secondly one exome co-ordinates intersects with two or more genes (which is not the case here). So, you are getting 38 because either 38 of your exomes intersect with the genes you have given or few exomes intersect twice with same gene (which might not be the case). You can use parameter "-wb", which will output the match of a in b file, so you have an idea, but you have to drop "-u" in that case.

gunzip -c Danish_1.flt.vcf.gz | sed 1,23d | intersectBed -a stdin -b mygenes.bed -wb > output

If you change -a to mygenes.bed and -b to vcf file, you will get how many genes intersect with your vcf and uniqueness can be again controlled by "-u" parameter.

-u Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B. Restricted by -f and -r.

The manual has few more parameters if you like (specially "-f" - minimum overlap threshold)

I hope it helps.

Sukhi

ADD COMMENT
0
Entering edit mode

Thanks Sukhdeep for your point on header lines, that's helpful. When I used "intersectBed -a stdin -b mygenes.bed -wb" am getting 38 and with 'intersectBed -a stdin -b mygenes.bed -wb' am getting 12. My idea is to quantify the unique overlap between the mygenes.bed and *.vcfs so am now using -u option. My average overlap is around 34.286, I think this is very low for genes of interest, all of them are very well known genes.

ADD REPLY

Login before adding your answer.

Traffic: 2164 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6