Question: multiple bed - merge regions IF overlapping more than xx percent of size
gravatar for samuelcollombet
3.7 years ago by
samuelcollombet10 wrote:


I have a couple of bed files with regions (peaks of ChIPseq), and I would like to create a unique file with merge regions, where overlapping regions are fused IF the overlap cover more than xx % (in my case 60%) of one of the 2 regions (not necessary both).

I know how to do that with 2 files, however I have more (at least 4). I could do recursive and pipes to do it pairwise multiple times (with bedops or bedtools), but I was wondering if there would not be a cleaner way. 
I can use bedops to merge everything that overlap by 1bp for my bed files at once, but I do not find a way to control on the overlap fraction...

Any idea?

ADD COMMENTlink modified 3.7 years ago by Alex Reynolds27k • written 3.7 years ago by samuelcollombet10
gravatar for Alex Reynolds
3.7 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Let's look at the following BEDOPS-based solution, which is a modification of the approach authored here by Shane Neph:

$ bedops --everything file1.bed file2.bed ... fileN.bed \
    | bedmap --echo-map --fraction-both 0.6 - \
    | awk '(split($0, a, ";") > 1)' - \
    | sed 's/\;/\n/g' - \
    | sort-bed - \
    | uniq - \
    > answer.bed

It might look a bit overwhelming, but it is really quite straightforward if we walk through this line by line.

  1. Take the multiset union of all elements from all N input files with bedops --everything. Note that this is not the same as a --merge operation, which instead makes contiguous regions from all overlapping input regions. Pipe this unioned set to bedmap.
  2. For each element from all files, use bedmap --echo-map to print any other elements from the multiset union that overlap that first element by 60% or more, using --fraction-both 0.6. The result will be a semi-colon delimited string that contains one or more overlapping elements. Note that this result will always include the original element, because it overlaps itself by 100%. Pipe this result to the awk statement.
  3. The awk command splits the string on the semi-colon delimiter and prints out the string if there are two or more overlapping elements. We do this filter step, because we are not interested in any element overlapping itself and only itself, but only multiple (2+) elements that meet the 60% or greater overlap criterion. We pipe this string to sed.
  4. The sed statement replaces all semi-colons with newline characters. This creates BED-formatted output that we pipe to sort-bed.
  5. The sort-bed tool takes the overlapping elements and puts them in lexicographically-sorted order. We pipe this to uniq to remove duplicate elements.

The file answer.bed is a sorted BED file that contains all unique elements from file1.bed through fileN.bed, where there is mutual 60% or greater overlap.

Note that this is a highly efficient solution, compared with O(2^N) pairwise comparisons required by other approaches. Further, we make use of Unix standard input and output streaming (note the hyphen at the end of each command, denoting standard input) to avoid the cost of making unnecessary intermediate files; writing to disk is expensive. The only thing written to disk is the final result.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Alex Reynolds27k

I actually did something like that previously, but:
here I will get all the regions that overlap an other region by more than 60% (actually I do not want to require 60% for both regions, but using  
--fraction-ref should do the job).
Now if I have A overlap B with >60%, C overlap D with >60%, but B overlap C with <60% for both. This method should still give me back A, B ,C and D, because for all of them one overlap of >60% can be found. Now if I want to merge my regions if they significantly overlap, if I do a merge of the results, I will merge them all, because all overlap, whereas I only want to merge A with B and C with D (and to keep 2 regions AB and CD that overlap a Little bit). That is what is where I am stuck.

There might be the problem of A overlap B significantly, B overlap C significantly, but A and C do not. In this case I would actually want to merge them all together (personal choice).

I guess what I want is, after the bedmap, take for each like the start of the first element, and the end of the last one (then I will have some overlapping feature, like i nthe example with A,B and C, a line with AB, a line with BAC, and a line with BC ; but I think their should always be one feature that overlap completely the others, so i can do another bedmap --fraction-ref 1). There can be more than 2, so it won't always be the same column number... I am not very good with awk but I guess i can do that.


ADD REPLYlink written 3.7 years ago by samuelcollombet10

"and to keep 2 regions AB and CD that overlap a Little bit"

I guess I don't understand how you intend to keep AB and CD separate after a merge operation, when they overlap a little bit. Can you be more explicit about your "inclusion" and "exclusion" criteria for all elements, if that makes sense? Maybe draw out what you are trying to do, and include numbers for overlap criteria.

If your criteria is simply 60% minimum overlap for membership in the output file, then what I provided should work. However, if you have two different minimum overlap criteria, you might do two runs of my script with different fraction parameters. Given the two resulting answer.bed files, you might again use bedops on both files to determine which elements are not common to both, using that result to apply custom merge logic with awk or similar.

ADD REPLYlink written 3.7 years ago by Alex Reynolds27k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 968 users visited in the last hour