Bed file: merging intervals that overlap a certain percentage
2
1
Entering edit mode
5.9 years ago
shwethacm ▴ 230

Hello,

I have a BED file that has many intervals that are overlapping.

My objective is to merge the intervals to see how much of the chromosome the bed file spans.

BEDtools merge was my natural go-to method, however, there's a catch.

I only want to merge overlaps that reciprocally overlap by say, 90% . BEDtools merge as far as I know, doesn't have this feature. Does anyone know any tool that I can use that can do this?

Thanks a bunch and merry christmas!

Latest BEDtools Bed file manipulation • 5.5k views
0
Entering edit mode

Alex, this is great!

Thanks a ton.

Just two things that I want to clarify:

1. the first awk $1>1 filters out the intervals that do not overlap with any other intervals, so if I want to keep them, I can just get rid of this filter (...right? That's how I understood this. For what I am trying to do I will want to keep them ((I'm trying to split the chromosome in regions such that highly overlapping regions aren't broken)) - I will use the bedops complement for this once I gather my intervals) 2. For reciprocal overlap (90% w.r.t both the intervals being considered for merging), should't the fraction option be --fraction-both instead of fraction-either? Again, thanks a ton for your timely help. It is MUCH appreciated! ADD REPLY 0 Entering edit mode You may also consider using Homer. It has an options from one file to multiple. Bedops is one of the good ones but it created duplicates for my case. ADD REPLY 10 Entering edit mode 5.9 years ago With BEDOPS bedmap: $ bedmap --count --echo-map-range --fraction-both 0.9 --delim '\t' intervals.bed \
| awk '$1>1' - \ | cut -f2- - \ | sort-bed - \ | uniq - \ > answer.bed  If the intervals are unsorted or are of unknown sort state, first sort before mapping: $ sort-bed intervals.unsorted.bed > intervals.bed


On further thought, I think you may want to filter single-overlap cases; please see discussion further in this thread.

Edit: Changed --fraction-either to --fraction-both to do a correct mutual overlap test.

1
Entering edit mode
5.9 years ago

the first awk $1>1 filters out the intervals that do not overlap with any other intervals, so if I want to keep them, I can just get rid of this filter The awk step filters out an element which overlaps itself - in that, of course, every element overlaps itself by 100%, which will always be higher than or equal to any specified threshold. If you want to keep these elements, then you can remove the awk statement. Not filtering might be problematic where you have elements A and B that overlap, but they do not meet the mutual threshold. Consider the following intervals: chrN 50 1000 chrN 60 10000  While the first element overlaps the second by ~99%, the second element overlaps the first by ~9%. If you do not filter, then you still get both elements back, despite them not meeting the threshold you set. They overlap, so even if you are to merge them afterwards with bedops --merge or similar, then you end up violating your overlap threshold. Instead, I would suggest you do the first approach, filtering any self-overlapping elements. Then do a union of this result with any elements that overlap themselves and only themselves ("exclusive" intervals). The unioned set should have disjoint (non-overlapping) elements. To explain further: $ bedmap --count --echo-map-range --fraction-both 0.9 --delim '\t' intervals.bed \
| awk '$1>1' - \ | cut -f2- - \ | sort-bed - \ | uniq - \ > mutuallyOverlappingIntervals.bed$ bedops --merge intervals.bed > mergedIntervals.bed
$bedmap --echo-map --exact --skip-unmapped intervals.bed mergedIntervals.bed > exclusiveIntervals.bed$ bedops --everything exclusiveIntervals.bed mutuallyOverlappingIntervals.bed > finalAnswer.bed


I suspect this would work better at adding self-overlapping intervals with mutually-overlapping intervals, since this avoids the possibility of overlaps between elements in these two subsets.

In other words, the intervals in the final answer should be disjoint - counting bases in these elements should not result in double-counts - and should also respect the mutual overlap threshold, in the case where elements had overlapped.

For reciprocal overlap (90% w.r.t both the intervals being considered for merging), should't the fraction option be --fraction-both instead of fraction-either?

I think you are correct. If the lengths of elements A and B are sufficiently different, then element A's overlap of element B may be of a much higher or lower fraction of A's length than B's length. Please use --fraction-both. Sorry, I'll edit my answer.

0
Entering edit mode
$bedmap --echo-map --exact --skip-unmapped intervals.bed mergedIntervals.bed > exclusiveIntervals.bed  Isn't necessary to add uniq - to that step in case there are duplicated regions in your original intervals.bed file? As in: bedmap --echo-map --exact --skip-unmapped intervals.bed mergedIntervals.bed | sort-bed - | uniq - > exclusiveIntervals.bed  I used this as an example of intervals.bed to merge and map: chr1 4394800 4395357 chr1 4416903 4417046 chr1 4426956 4427083 chr1 4497875 4498980 chr1 4507891 4507997 chr1 4544029 4544271 chr1 4598562 4598840 chr1 14357732 14358035 chr1 14357732 14358035  ADD REPLY 1 Entering edit mode bedmap just calculates overlaps based on whatever input reference file you give it. So you could just pipe unique intervals (assuming they are sorted): $ uniq intervals.bed | bedmap --options - mergedIntervals.bed | ...


In this case, the use of the hyphen in bedmap in between options and the merged interval file denotes that it is consuming standard input from the upstream process (uniq) in place of the reference file.

You could also do this afterwards, but it is slightly more efficient to filter upstream of bedmap.

0
Entering edit mode

Great! Thanks for clarifying.