Question: Intersectbed - Overlap Analysis Usign Vcf And Bed Files
0
gravatar for Khader Shameer
6.8 years ago by
Manhattan, NY
Khader Shameer18k wrote:

I am trying to do an overlap analysis between 200 danish exomes (VCF courtsey: Zev) and 10 different gene regions.
I would like to know what percentage overlaps between my region of interest (in mygenes.bed total of 36 lines representing the region) and a VCF file (Danish_*.flt.vcf.gz).

I have tried this command and got result: intersectBed -a Danish1.flt.vcf.gz -b mygenes.bed > D1result.txt

Danish1.flt.vcf.gz: here mygenes.bed: here D1overlapped.txt: here

My assumption is that the output should have lines <= the total number of lines in the mygenes.bed file. But in many instances I am getting more than 36 lines as output. May be am missing something important or may be another tool / option in bedtools can do this task more efficiently. Please let me know your thoughts.

bedtools • 7.4k views
ADD COMMENTlink modified 6.8 years ago by Sukhdeep Singh9.7k • written 6.8 years ago by Khader Shameer18k
5
gravatar for Sukhdeep Singh
6.8 years ago by
Sukhdeep Singh9.7k
Netherlands
Sukhdeep Singh9.7k wrote:

Hi,

First of all, I would remove the header from the files to be intersected, like your first file, Danish1.flt.vcf.gz has a 23 line header.

gunzip -c Danish_1.flt.vcf.gz | sed 1,23d | intersectBed -u -a stdin -b mygenes.bed > output

Second, you are getting 38 lines in the output, because first possibility is that more than two exome co-coordinates are intersected between same gene, you can control that by "-u" and secondly one exome co-ordinates intersects with two or more genes (which is not the case here). So, you are getting 38 because either 38 of your exomes intersect with the genes you have given or few exomes intersect twice with same gene (which might not be the case). You can use parameter "-wb", which will output the match of a in b file, so you have an idea, but you have to drop "-u" in that case.

gunzip -c Danish_1.flt.vcf.gz | sed 1,23d | intersectBed -a stdin -b mygenes.bed -wb > output

If you change -a to mygenes.bed and -b to vcf file, you will get how many genes intersect with your vcf and uniqueness can be again controlled by "-u" parameter.

-u Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B. Restricted by -f and -r.

The manual has few more parameters if you like (specially "-f" - minimum overlap threshold)

I hope it helps.

Sukhi

ADD COMMENTlink written 6.8 years ago by Sukhdeep Singh9.7k

Thanks Sukhdeep for your point on header lines, that's helpful. When I used "intersectBed -a stdin -b mygenes.bed -wb" am getting 38 and with 'intersectBed -a stdin -b mygenes.bed -wb' am getting 12. My idea is to quantify the unique overlap between the mygenes.bed and *.vcfs so am now using -u option. My average overlap is around 34.286, I think this is very low for genes of interest, all of them are very well known genes.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Khader Shameer18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1208 users visited in the last hour