Question: Bed file: merging intervals that overlap a certain percentage
0
gravatar for shwethacm
3.8 years ago by
shwethacm200
Seattle, WA
shwethacm200 wrote:

Hello, 

I have a BED file that has many intervals that are overlapping. 

My objective is to merge the intervals to see how much of the chromosome the bed file spans. 

BEDtools merge was my natural go-to method, however, there's a catch. 

I only want to merge overlaps that reciprocally overlap by say, 90% . BEDtools merge as far as I know, doesn't have this feature. Does anyone know any tool that I can use that can do this? 

Thanks a bunch and merry christmas! 

ADD COMMENTlink modified 3.8 years ago by Alex Reynolds29k • written 3.8 years ago by shwethacm200
10
gravatar for Alex Reynolds
3.8 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

With BEDOPS bedmap:

$ bedmap --count --echo-map-range --fraction-both 0.9 --delim '\t' intervals.bed \
    | awk '$1>1' - \
    | cut -f2- - \
    | sort-bed - \
    | uniq - \
    > answer.bed

If the intervals are unsorted or are of unknown sort state, first sort before mapping:

$ sort-bed intervals.unsorted.bed > intervals.bed

On further thought, I think you may want to filter single-overlap cases; please see discussion further in this thread.

Edit: Changed --fraction-either to --fraction-both to do a correct mutual overlap test.

ADD COMMENTlink modified 3.1 years ago • written 3.8 years ago by Alex Reynolds29k
1
gravatar for Alex Reynolds
3.8 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

1. the first awk '$1>1' filters out the intervals that do not overlap with any other intervals, so if I want to keep them, I can just get rid of this filter

The awk step filters out an element which overlaps itself — in that, of course, every element overlaps itself by 100%, which will always be higher than or equal to any specified threshold. If you want to keep these elements, then you can remove the awk statement.

Not filtering might be problematic where you have elements A and B that overlap, but they do not meet the mutual threshold. Consider the following intervals:

chrN      50      1000
chrN      60     10000

While the first element overlaps the second by ~99%, the second element overlaps the first by ~9%. If you do not filter, then you still get both elements back, despite them not meeting the threshold you set. They overlap, so even if you are to merge them afterwards with bedops --merge or similar, then you end up violating your overlap threshold.

Instead, I would suggest you do the first approach, filtering any self-overlapping elements. Then do a union of this result with any elements that overlap themselves and only themselves ("exclusive" intervals). The unioned set should have disjoint (non-overlapping) elements.

To explain further:

$ bedmap --count --echo-map-range --fraction-both 0.9 --delim '\t' intervals.bed \
    | awk '$1>1' - \
    | cut -f2- - \
    | sort-bed - \
    | uniq - \
    > mutuallyOverlappingIntervals.bed

$ bedops --merge intervals.bed > mergedIntervals.bed

$ bedmap --echo-map --exact --skip-unmapped intervals.bed mergedIntervals.bed > exclusiveIntervals.bed

$ bedops --everything exclusiveIntervals.bed mutuallyOverlappingIntervals.bed > finalAnswer.bed

I suspect this would work better at adding self-overlapping intervals with mutually-overlapping intervals, since this avoids the possibility of overlaps between elements in these two subsets.

In other words, the intervals in the final answer should be disjoint — counting bases in these elements should not result in double-counts — and should also respect the mutual overlap threshold, in the case where elements had overlapped.

2. For reciprocal overlap (90% w.r.t both the intervals being considered for merging), should't the fraction option be --fraction-both instead of fraction-either ? 

I think you are correct. If the lengths of elements A and B are sufficiently different, then element A's overlap of element B may be of a much higher or lower fraction of A's length than B's length. Please use --fraction-both. Sorry, I'll edit my answer.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Alex Reynolds29k
`$ bedmap --echo-map --exact --skip-unmapped intervals.bed mergedIntervals.bed > exclusiveIntervals.bed`

Isn't necessary to add uniq - to that step in case there are duplicated regions in your original intervals.bed file? As in:

`bedmap --echo-map --exact --skip-unmapped intervals.bed mergedIntervals.bed | sort-bed - | uniq - > exclusiveIntervals.bed`

I used this as an example of intervals.bed to merge and map:

chr1    4394800 4395357
chr1    4416903 4417046
chr1    4426956 4427083
chr1    4497875 4498980
chr1    4507891 4507997
chr1    4544029 4544271
chr1    4598562 4598840
chr1    14357732    14358035
chr1    14357732    14358035
ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Fernando Rossello0
1

bedmap just calculates overlaps based on whatever input reference file you give it. So you could just pipe unique intervals (assuming they are sorted):

$ uniq intervals.bed | bedmap --options - mergedIntervals.bed | ...

In this case, the use of the hyphen in bedmap in between options and the merged interval file denotes that it is consuming standard input from the upstream process (uniq) in place of the reference file.

You could also do this afterwards, but it is slightly more efficient to filter upstream of bedmap.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Alex Reynolds29k

Great! Thanks for clarifying.

ADD REPLYlink written 3.1 years ago by Fernando Rossello0
0
gravatar for shwethacm
3.8 years ago by
shwethacm200
Seattle, WA
shwethacm200 wrote:

Alex, this is great! 

Thanks a ton. 

Just two things that I want to clarify:

1. the first awk '$1>1' filters out the intervals that do not overlap with any other intervals, so if I want to keep them, I can just get rid of this filter

(...right? That's how I understood this. For what I am trying to do I will want to keep them ((I'm trying to split the chromosome in regions such that highly overlapping regions aren't broken)) - I will use the bedops complement for this once I gather my intervals )  

2. For reciprocal overlap (90% w.r.t both the intervals being considered for merging), should't the fraction option be --fraction-both instead of fraction-either ? 

Again, thanks a ton for your timely help. It is MUCH appreciated! 

ADD COMMENTlink written 3.8 years ago by shwethacm200

You may also consider using Homer. It has an options from one file to multiple. Bedops is one of the good ones but it created duplicates for my case. 

ADD REPLYlink written 3.8 years ago by morovatunc400
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2243 users visited in the last hour