Bedtools Compare Multiple Bed Files?
3
21
Entering edit mode
11.1 years ago
Bioscientist ★ 1.7k

I've been dealing with comparison between two bed files using intersectBed -a -b command. I'm just wondering, is there any commands in Bedtools which can help us compare multiple bed files?

Say, I have 3 bed files (A,B,C). I want to identify those regions where any two of the three (AB,BC,AC)overlaps reciprocally 50%.....

thx

edit: Just find this post right now.Maybe I didn't express quite well a couple of months ago. I mean to find those overlappings which spans at least 50% of EACH of the multiple bed files. So I don't quite understand cat AB BC AC > ABC.common Means to find the overlapping part of all the three?

I myself try to solve the problem like below:

intersectBed -a 2 -b 3 > 23
intersectBed -a 1 -b 3 > 13
intersectBed -a 1 -b 2 > 12

intersectBed -a 1 -b 23 -f 0.50|sort > 23_1
intersectBed -a 2 -b 13 -f 0.50|sort > 13_2
intersectBed -a 3 -b 12 -f 0.50|sort > 12_3

comm -1 -2 23_1 13_2 > test
comm -1 -2 test 1_3 > final result


I don't know if I'm on the right track. thx

bedtools intersect • 69k views
53
Entering edit mode
11.1 years ago

Inspired by the limitations of the approaches I mentioned above, I just released a new tool called multiIntersectBed in bedtools version 2.14.3. I realize that this solution doesn't address your request for 50% reciprocal overlap, but I can't yet envision an efficient way to do that other than what has already been proposed.

The basic concept of this approach is that it compares the intervals found in N sorted (-k1,1 -k2,2n for BED) BED/GFF/VCF files and reports whether 0 to N of those files are present at each interval.

An example is likely best to illustrate what the tool does. First, here's a graphical representation:

Now, an example with real BED files and real output.

$cat a.bed chr1 6 12 chr1 10 20 chr1 22 27 chr1 24 30$ cat b.bed
chr1    12    32
chr1    14    30

$cat c.bed chr1 8 15 chr1 10 14 chr1 32 34  In the example below, the first three columns define the interval, the fourth column reports the number of files present at that interval, the fifth column reports a comma-separated list of files present at that interval, and the 6th through 8th columns report whether (1) or not (0) each file is present. The order is the same as on the command line. $ multiIntersectBed -i a.bed b.bed c.bed
chr1    6    8    1    1    1    0    0
chr1    8    12    2    1,3    1    0    1
chr1    12    15    3    1,2,3    1    1    1
chr1    15    20    2    1,2    1    1    0
chr1    20    22    1    2    0    1    0
chr1    22    30    2    1,2    1    1    0
chr1    30    32    1    2    0    1    0
chr1    32    34    1    3    0    0    1


The above example only reports intervals where >=1 file has coverage. We can also get a complete picture of the chrom by using the -empty parameter and by providing a genome (chrom sizes) file:

$multiIntersectBed -i a.bed b.bed c.bed -empty -g genomes/human.hg18.genome chr1 0 6 0 none 0 0 0 chr1 6 8 1 1 1 0 0 chr1 8 12 2 1,3 1 0 1 chr1 12 15 3 1,2,3 1 1 1 chr1 15 20 2 1,2 1 1 0 chr1 20 22 1 2 0 1 0 chr1 22 30 2 1,2 1 1 0 chr1 30 32 1 2 0 1 0 chr1 32 34 1 3 0 0 1 chr1 34 247249719 0 none 0 0 0  We can also get a header: $ multiIntersectBed -i a.bed b.bed c.bed -empty -g genomes/human.hg18.genome -header
chrom    start    end    num    list
chr1    0    6    0    none    0    0    0
chr1    6    8    1    1    1    0    0
chr1    8    12    2    1,3    1    0    1
chr1    12    15    3    1,2,3    1    1    1
chr1    15    20    2    1,2    1    1    0
chr1    20    22    1    2    0    1    0
chr1    22    30    2    1,2    1    1    0
chr1    30    32    1    2    0    1    0
chr1    32    34    1    3    0    0    1
chr1    34    247249719    0    none    0    0    0


And a header with labels. Note that if we use labels, the fourth column reports a list of labels rather than a list of file indices:

$multiIntersectBed -i a.bed b.bed c.bed -header -names A B C chrom start end num list A B C chr1 6 8 1 A 1 0 0 chr1 8 12 2 A,C 1 0 1 chr1 12 15 3 A,B,C 1 1 1 chr1 15 20 2 A,B 1 1 0 chr1 20 22 1 B 0 1 0 chr1 22 30 2 A,B 1 1 0 chr1 30 32 1 B 0 1 0 chr1 32 34 1 C 0 0 1  If you are interested in an easier to follow (yet less efficient) version of the algorithm, I have posted the python prototype I developed with pybedtools. ADD COMMENT 6 Entering edit mode +1. Great example of how questions on biostar are helping stimulate advances in bioinformatics technology. ADD REPLY 1 Entering edit mode agreed. it was quite fun to write. ADD REPLY 0 Entering edit mode Hi Aaron, Is the -f parameter still functional in MultiIntersectBed, if I need a minimum overlap of 50% in all the three files. Thanks ADD REPLY 0 Entering edit mode oh cannot thanks more.. ADD REPLY 0 Entering edit mode Might be a very naive question, Well I have 5 peak files and the sum of peaks is 100000 but when I use multiIntersectBed I get a total number of peaks to be 150000 why such a big difference? Does the script breaks the total regions into some bins and then makes an intersection?, if yes. what is the size of these bins? ADD REPLY 0 Entering edit mode Almost exactly what I'm looking for. Is there a way to also print out all the features from each of the 3 bed files? Thanks. ADD REPLY 0 Entering edit mode -empty ADD REPLY 0 Entering edit mode Is multiinter strand-aware? The output doesn't give any clues whether it is! ADD REPLY 21 Entering edit mode 11.1 years ago One approach with bedtools is the following. intersectBed -a A.bed -b B.bed -f 0.5 -r > AB intersectBed -a B.bed -b C.bed -f 0.5 -r > BC intersectBed -a A.bed -b C.bed -f 0.5 -r > AC cat AB BC AC > ABC.common  You could generalize this with something like: for file1 in ls *.bed do for file2 in ls *.bed do if [$file1 != $file2 ] then intersectBed -a$file1 -b $file2 -f 0.5 -r >$file1.$file2.common fi done done cat *.common > all.common  If you are a python programmer, you could also do the following with pybedtools, the new python extension of bedtools. import pybedtools # set up 3 different bedtools a = pybedtools.BedTool('A.bed') b = pybedtools.BedTool('B.bed') c = pybedtools.BedTool('C.bed') # make the combinations. ab = a.intersect(b, f=0.5, r=True) bc = b.intersect(c, f=0.5, r=True) aa = a.intersect(c, f=0.5, r=True)  Lastly, there is a thread about this on the bedtools mailing list that may be helpful if you plan on using the pybedtools approach. ADD COMMENT 0 Entering edit mode Hi aaron, maybe you can have a look at my edit of the post.thx ADD REPLY 9 Entering edit mode 9.9 years ago sjneph ▴ 690 You might check out the BEDOPS suite of tools, which has the capability to work with any number of BED inputs at once. Consider:  bedops -u file1.bed file2.bed file3.bed \ | bedmap --echo --count --fraction-both 0.5 - \ | awk -F"|" 'int($2) > 1' \
| cut -f1 -d'|' \


This gives to you every input row (from any input file) that overlaps some other input row (from any input file) by at least 50% reciprocally (by using the --fraction-both flag). Now, it's possible that two overlapping elements come from the same file (in the general case, though your input files may not have that sort of thing going on), and you probably do not want that. If that's true, this can also be dealt with easily in BEDOPS with a small amount of additional awk code. Here is a more general solution that deals with removing the problem case mentioned:

  bedops -u file1.bed file2.bed file3.bed \
| bedmap --echo --echo-map-id-uniq --fraction-both 0.5 - \
| awk -F"|" '(split(\$2, a, ";") > 1)' \
| cut -f1 -d'|' \


Note that while the code is concise, this usage is also very efficient. An alternative approach that loops over all-pairwise file comparisons requires on the order of (N^2)/2 system calls (and includes N^2 entire file sweeps) and intermediate results (files) to manage on your way to a final solution. This approach quickly worsens if you change the problem in a simple way as well - what if you require that 3 or more input files overlap? The solution shown with BEDOPS scales just fine. You just change the awk statement to ask '> 2' instead of the '> 1' I show.

However, not all is free with the approach I show either. Each of the files above needs to be sorted - which is fast, but also the 4th column in each of these files needs an id that identifies the file. For example, file1.bed should look something like:

chr1  10   20    1  anything-else-you-like  ...
chr1  200  203   1  ...


where the 4th column identifies that this is file 1 (IDs are easily removed later on with a cut command if you don't like them, though a side benefit is that answer.bed will show from which file each row of output originated from). The programs from BEDOPS work with sorted files only. The upfront cost of sorting data pays increasing dividends the more often you use the file, as underlying algorithms can use less memory and run faster. And, any files produced by BEDOPS are already sorted properly, so that removes the sorting overhead entirely when using them for further analyses.

Importantly, there are really only 2 programs you need to know about in BEDOPS to do the vast majority of all queries related to BED. These are bedops and bedmap, which are both used here. Rather than write new programs to answer every form of informatic question, the suite relies upon standard unix utils to manipulate data in simple ways on the fly just as shown here - literally, a one line awk statement. If we exclude the final cut statement above, the output also includes exactly which files are part of the multi-file intersection, on a per-row basis.