multiple bed - merge regions IF overlapping more than xx percent of size
2
1
Entering edit mode
8.9 years ago

Hi,

I have a couple of bed files with regions (peaks of ChIPseq), and I would like to create a unique file with merge regions, where overlapping regions are fused IF the overlap cover more than xx % (in my case 60%) of one of the 2 regions (not necessary both).

I know how to do that with 2 files, however I have more (at least 4). I could do recursive and pipes to do it pairwise multiple times (with bedops or bedtools), but I was wondering if there would not be a cleaner way. I can use bedops to merge everything that overlap by 1bp for my bed files at once, but I do not find a way to control on the overlap fraction...

Any idea?

ChIP-Seq bedops bedtools GenomicRanges • 4.9k views
ADD COMMENT
3
Entering edit mode
8.9 years ago

Let's look at the following BEDOPS-based solution, which is a modification of the approach authored here by Shane Neph:

$ bedops --everything file1.bed file2.bed ... fileN.bed \
    | bedmap --echo-map --fraction-both 0.6 - \
    | awk '(split($0, a, ";") > 1)' - \
    | sed 's/\;/\n/g' - \
    | sort-bed - \
    | uniq - \
    > answer.bed

It might look a bit overwhelming, but it is really quite straightforward if we walk through this line by line.

  1. Take the multiset union of all elements from all N input files with bedops --everything. Note that this is not the same as a --merge operation, which instead makes contiguous regions from all overlapping input regions. Pipe this unioned set to bedmap.
  2. For each element from all files, use bedmap --echo-map to print any other elements from the multiset union that overlap that first element by 60% or more, using --fraction-both 0.6. The result will be a semi-colon delimited string that contains one or more overlapping elements. Note that this result will always include the original element, because it overlaps itself by 100%. Pipe this result to the awk statement.
  3. The awk command splits the string on the semi-colon delimiter and prints out the string if there are two or more overlapping elements. We do this filter step, because we are not interested in any element overlapping itself and only itself, but only multiple (2+) elements that meet the 60% or greater overlap criterion. We pipe this string to sed.
  4. The sed statement replaces all semi-colons with newline characters. This creates BED-formatted output that we pipe to sort-bed.
  5. The sort-bed tool takes the overlapping elements and puts them in lexicographically-sorted order. We pipe this to uniq to remove duplicate elements.

The file answer.bed is a sorted BED file that contains all unique elements from file1.bed through fileN.bed, where there is mutual 60% or greater overlap.

Note that this is a highly efficient solution, compared with O(2^N) pairwise comparisons required by other approaches. Further, we make use of Unix standard input and output streaming (note the hyphen at the end of each command, denoting standard input) to avoid the cost of making unnecessary intermediate files; writing to disk is expensive. The only thing written to disk is the final result.

ADD COMMENT
0
Entering edit mode

I actually did something like that previously, but:

here I will get all the regions that overlap an other region by more than 60% (actually I do not want to require 60% for both regions, but using --fraction-ref should do the job).

Now if I have A overlap B with >60%, C overlap D with >60%, but B overlap C with <60% for both. This method should still give me back A, B ,C and D, because for all of them one overlap of >60% can be found. Now if I want to merge my regions if they significantly overlap, if I do a merge of the results, I will merge them all, because all overlap, whereas I only want to merge A with B and C with D (and to keep 2 regions AB and CD that overlap a Little bit). That is what is where I am stuck.

There might be the problem of A overlap B significantly, B overlap C significantly, but A and C do not. In this case I would actually want to merge them all together (personal choice).

I guess what I want is, after the bedmap, take for each like the start of the first element, and the end of the last one (then I will have some overlapping feature, like in the example with A, B and C, a line with AB, a line with BAC, and a line with BC ; but I think their should always be one feature that overlap completely the others, so I can do another bedmap --fraction-ref 1). There can be more than 2, so it won't always be the same column number. I am not very good with awk but I guess I can do that.

ADD REPLY
1
Entering edit mode

and to keep 2 regions AB and CD that overlap a Little bit

I guess I don't understand how you intend to keep AB and CD separate after a merge operation, when they overlap a little bit. Can you be more explicit about your "inclusion" and "exclusion" criteria for all elements, if that makes sense? Maybe draw out what you are trying to do, and include numbers for overlap criteria.

If your criteria is simply 60% minimum overlap for membership in the output file, then what I provided should work. However, if you have two different minimum overlap criteria, you might do two runs of my script with different fraction parameters. Given the two resulting answer.bed files, you might again use bedops on both files to determine which elements are not common to both, using that result to apply custom merge logic with awk or similar.

ADD REPLY
0
Entering edit mode

If you are looking at 3 separate bed files representing 3 replicates for instance, how would you go about limiting the printed result with this method to only elements present in all 3 of the replicates (at some % overlap)? Thanks.

ADD REPLY
1
Entering edit mode

I think you would just modify the awk statement to filter explicitly for those elements wherein there are exactly three elements that meet the --fraction-both 0.6 criterion:

$ N=3
$ bedops --everything file1.bed file2.bed ... fileN.bed \
    | bedmap --echo-map --fraction-both 0.6 - \
    | awk -vN=${N} '(split($0, a, ";") == N)' - \
    | sed 's/\;/\n/g' - \
    | sort-bed - \
    | uniq - \
    > answer.bed

Does that make sense?

ADD REPLY
0
Entering edit mode

Absolutely, exactly what I needed, thanks!

ADD REPLY
0
Entering edit mode

So I have the following code:

N=6
bedops --everything \
    | bedmap --echo-map --fraction-both 0.1 - \
    | awk -vN=${N} '(split($0, a, ";") == N)' - \
    | sed 's/\;/\n/g' - \
    | sort-bed - \
    | uniq - \
    | bedops --merge - \
    > answer1.bed

I input 6 bed files for an overlap that will give me intervals where at least 6 overlap at .1, however sometimes there is the instance where there are two adjoining intervals within one of the files that overlap the other files getting the total number to 6 when really only 5 of the files showed overlaps. Wondering how I can adjust it so it shows only 6/6 overlaps across the files. Thanks.

ADD REPLY
0
Entering edit mode

I'm trying to understand the problem. Can you give me a test input that fails? Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6