Question

Is there any way to merge intervals based on a minimum overlap?

0

Entering edit mode

5.2 years ago

VicGB • 0

I'm trying to find a way to merge a bed file like bedtools merge does but requiring a minimum overlap between intervals. To illustrate my idea, let's suppose we have 4 genomic regions: A, B, C and D. A and B accomplish a minimum overlap that I wish and so does C and D, but I don't want to merge {A B} with {C D} only because B and C also shares pass the threshold, because A doesn't overlap enough with C.

In other words, I would want to have [AB] and [BCD] regions merged.

I have thought to do merge my bed file sequentially on R with a for loop but I'm struggling with it, and I haven't found out a response in this forum.

Thanks in advance.

Merge BED • 3.2k views

ADD COMMENT • link updated 5.2 years ago by Alex Reynolds 35k • written 5.2 years ago by VicGB • 0

0

Entering edit mode

Thanks for the suggestions, but because I wrote the post rapidly, I didn't explain myself correctly as I gave incorrect details. I have used bedtools intersect with -f and -F 0.5 and -e, so if only one region is covered enough. The thing is, [A B], [B C] and [C D] accomplish this requisite, but not A with C.

In short, I don't wish to merge all regions just because B and C also pass the filter. As I stated before, I thought about doing merge sequentially (A with B, then [AB] doesn't overlap 0.5 with C. Now I check B with C so I merge then and finally [BC] with D so I would have [AB] and [BCD]). I will update the initial post.

ADD REPLY • link 5.2 years ago by VicGB • 0

score 2 · Answer 1 · 2019-02-19

It's an interesting quirk. Let's step back and draw some cartoons to see if that might help. I think the following case may help suggest a solution, or the start to one.

It seems like a "worst-case" scenario for you seems to be when merging elements A and B that overlap by your threshold, and B and C overlap by your threshold, but A and C are adjoining (and disjoint) or otherwise do not overlap by your threshold criterion.

Let's sketch this out. Here, chrZ is some weird abstract chromosome, and we have elements A, B, and C that span this chromosome:

|-----|-----|-----|-----| chrZ
[--   A   --)
      [--   B   --)
            [--   C   --)

A simple merge on a 0.5 or 50% overlap threshold will correctly give:

|-----|-----|-----|-----| chrZ
[--     AB      --)
[--        ABC        --)
      [--      BC     --)

In this worst-case scenario, the middle element from the result set (ABC) is calculated because B overlaps both A and C by 50%, and so a merge gives ABC.

It sounds like what you really want is something that looks like this:

|-----|-----|-----|-----| chrZ
[--     AB      --)
      [--     BC      --)

In other words, you want to remove any merge results from the set {AB, ABC, BC} which contain a nested element from the set of {A, B, C}. (Or, the larger set of merges for more than three elements, where there are full or partial pairwise 50% overlaps, more generally.)

A nested element is one for which its start and stop coordinates fall within the bounds of its parent interval.

A merge element result that contains a nested element is one generated from a use case where one of the elements contributing to that merge result does not meet your threshold over all pairwise comparisons. I think this property can be exploited.

In the above worst-case scenario, merge result ABC contains the nested element B, where both the start and stop coordinates of B fall within the start and stop coordinates of ABC.

You could use awk to strip out such results from a bedmap call, by using awk to test the start and end coordinates of each input coordinate from in.bed against the merged range result that comes out of --echo-map-range:

$ bedmap --fraction-map 0.5 --echo-map-range --echo --delim '\t' in.bed | awk -vFS="\t" -vOFS="\t" '{ if (($2 == $5) || ($3 == $6)) { print $1, $2, $3 } }' | sort -u | sort-bed - > out.bed

Note that this also pipes results from bedmap and awk to sort -u, in order take care of duplicate --echo-map-range results (generated from other, more relaxed overlap situations) and there is a final pipe to sort-bed, in order to ensure generating a sorted BED file that is ready for further set operations.

I think this works, but I may be wrong by missing a use case. Please feel free to give it a go and post back sample data, if you believe it doesn't meet your needs.

score 1 · Answer 2 · 2019-02-19

I think you can use bedtools intersect, with the -f (minimum fraction of overlap) or -r (reciprocal overlap) option. In the worse case you just can use bedtools intersect, ask to print the length of overlap and then remove all the instances in which the overlap is shorter than your threshold. Maybe some bedtools expert here can provide more details.

score 1 · Answer 3 · 2019-02-19

Here's one way via BEDOPS bedmap.

Given test.bed:

$ cat test.bed
chr1    100 102
chr1    101 103
chr1    103 106
chr1    104 107
chr1    106 108

Say we are interested in only merging elements from test.bed which overlap by two or more bases.

We can use bedmap --bp-ovr <N> to specify that we want only those elements which overlap other elements by N or more bases, use --count to count how many such overlaps take place, filter for elements that overlap at least one other such element, and then pipe them to bedops --merge to merge them:

$ bedmap --count --echo --bp-ovr 2 --delim '\t' test.bed | awk '($1>1)' | cut -f2- | bedops --merge -
chr1    103 107

As you can see, this leaves out single-base overlap elements from the merge result.

In place of --bp-over <N>, you could use --fraction-map <F>, where F specifies fractional overlap as a value between 0 and 1.