How Can I Compare And Merge Bed Files
5
0
Entering edit mode
11.7 years ago
lyfsa ▴ 30

I have three bed files with chrNo, start, end position and type. I need to compare each chrNo, start and end position of one file with 2 other files and write the common one in a new file. Can any one suggest how can I do this efficiently? I wrote the simple perl script, but as the file is huge, it is taking a lot of time, thus is not feasible. Thanks in advance

Example files:

file1.bed:

1 20 30

1 100 120

1 200 300

file2.bed:

1 2 5

1 25 34

1 200 300

file3.bed:

1 30 33

1 200 300

1 500 600

common.bed

1 30 34 --> coordinates with overlapping 5bp is considered as same but outermost coordinates of the 3 is taken in common file

1 200 300

bed bedtools • 31k views
ADD COMMENT
0
Entering edit mode

It'd be nice if you change the tag to something appropriate for your post, like bedtools, mergebed.

ADD REPLY
0
Entering edit mode

Can I also merge the overlapping position, say start position and end position if in range of 0-50 ???

ADD REPLY
0
Entering edit mode

Please use comments under answers to ask further questions, rather than posting questions as answers.

ADD REPLY
0
Entering edit mode

Sorry, I overlooked the "merge overlapping" part in your question. I guess Sukhdeep's reply does exactly what you require.

ADD REPLY
0
Entering edit mode

the above given example files are bed files with chrNo, start and end position with 3lines in each file...I did not know how to post a separate example box in this post..

ADD REPLY
7
Entering edit mode
11.7 years ago
Arun 2.4k

use mergeBed from bedtools like this:

cat file1 file2 file3 | mergeBed -i stdin
ADD COMMENT
3
Entering edit mode
11.7 years ago

Don't merge them, you need multiIntersectBed, a tool part of BedTools suite used to find common overlap between more than two files.

Check this post for usage examples. I have asked Aaron, about specifying the minimum threshold of overlap to call it as an overlap while using multiIntersectBed. For your second question (which you have actually put as an answer), you can specify the distance between the reads/peaks for the merging to happed using -d parameter. From the manual

Controlling how close two features must be in order to merge (-d) By default, only overlapping or book-ended features are combined into a new feature. However, one can force mergeBed to combine more distant features with the –d option. For example, were one to set –d to 1000, any features that overlap or are within 1000 base pairs of one another will be combined.

For example:
$ cat A.bed
chr1 100 200
chr1 501 1000

$ mergeBed –i A.bed
chr1 100 200
chr1 501 1000

$ mergeBed –i A.bed –d 1000
chr1 100 200 1000

Cheers

ADD COMMENT
0
Entering edit mode

Hi! I tried to merge with multiIntersectBed, but the result I get is not what I want. I looked at the usage link you have posted. In there you have also suggested an approach, intersectBed -a 2 -b 3 > 23 intersectBed -a 1 -b 3 > 13 intersectBed -a 1 -b 2 > 12

intersectBed -a 1 -b 23 -f 0.50|sort > 231 intersectBed -a 2 -b 13 -f 0.50|sort > 132 intersectBed -a 3 -b 12 -f 0.50|sort > 12_3

comm -1 -2 231 132 > test comm -1 -2 test 1_3 > final result

Will this work, if I have to get the common start and end position found in all three files considering the overlapping of 50bp?

ADD REPLY
0
Entering edit mode

In your question, when you mean common between all files, you mean exact chr, start and end positions between the 3 files? Can you edit your post with 2 files with an example case? If so, then its relatively easier to do this using unix commands...

ADD REPLY
0
Entering edit mode

I have now edited my post with examples. I didn't know how to use the separate box for example...so my post is not that clear...my files are bed file with chrNo, start and end postion with 3 lines in each file...hope you will get it :)

ADD REPLY
0
Entering edit mode

This command will give you overlap b/w all three files multiIntersectBed -i a.bed b.bed c.bed | awk '$4==3' but for the first overlap you should use chipPeakAnno, check the maxgap parameter.

Cheers

ADD REPLY
1
Entering edit mode
11.0 years ago
beary.pooh ▴ 10

bedtools v 2.17.0 provides multiIntersectBed

find it in /bin/

just type multiIntersectBed -i [file1] [file2] ...

ps. -i should be followed by file names "*.bam" does not work

ADD COMMENT
0
Entering edit mode

*.bed should work, however. BAM is not supoorted by mIB.

ADD REPLY
0
Entering edit mode
11.7 years ago
Sandeep ▴ 260

For people not very comfortable using bedTools or other command line methods, an alternative way would be to use Galaxy server. Operate on genomic intervals will let you merge data sets.

ADD COMMENT
0
Entering edit mode
11.2 years ago

You can use BEDOPS and set operations to solve this problem for three generic files. However, it is unclear from your overlap criteria how you are getting the end coordinate of 34. Note that the region chr1:33-34 is not common to all three sample input files, as presented, being found only in file2.bed.

In any case, here is how you can solve this problem:

$ bedops --merge file2.bed | bedmap --echo-map file1.bed - | bedops --intersect - file3.bed
chr1    30  33
chr1    200 300

Let's break down how these three commands work together.

(1) The bedops statement uses the --merge operator to merge elements in file2.bed into contiguous (non-overlapping) regions. This result is piped into the bedmap statement.

(2) The bedmap statement uses the --echo-map operator to report all contiguous regions in the merged file2.bed (the "map" file) which overlap elements in file1.bed (the "reference" file) by one or more bases:

$ bedops --merge file2.bed > file2.merged.bed
$ bedmap --echo-map file1.bed file2.merged.bed
chr1    25  34

chr1    200 300

(The second line is blank, because there is no region in the merged file2.bed which overlaps chr1:100-120 from file1.bed.)

(3) This result is piped into the last bedops statement, which uses the --intersect operator to intersect those two non-empty regions chr1:25-34 and chr1:200-300 with regions in file3.bed.

The final answer consists of bases that are common to file1.bed, file2.bed and file3.bed.

Note: The only assumption these tools make is that all input BED files are sorted. This allows BEDOPS apps to run very fast and with a low memory profile, as compared with alternative toolkits which do not require sorted input (or which have only recently added sorting requirements after publication of BEDOPS). For your example inputs, this is not an issue. For the general case, we provide the sort-bed application to prep the BED inputs, if the sort-states of the input BED files are unknown.

ADD COMMENT

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6