Question: How to identify consecutive (adjacent) genomic intervals
0
2.6 years ago by
A. Domingues1.6k
Mainz, Germany
A. Domingues1.6k wrote:

I have a list of genomic coordinates, bed format, and I want to determine which intervals are immediately followed by another. For example:

``````chr1 100 200 a1 . +
chr1 500 600 a2 . +
chr1 201 300 a3 . +
``````

So in this case the ouput would be:

``````chr1 100 200 a1 . + chr1 201 300 a3 . +
``````

(there also intervals in the minus strand)

My quest is on the face of it quite simple, but I can't think of a straighforward solution. I also looked for existing solutions using IRanges (R/Bioconductor) or bedtools but found none. Something in R/Bioconductor would be appreciated, but bash or python are also welcome.

genomic intervals bed latest • 1.3k views
modified 4 days ago by Ram17k • written 2.6 years ago by A. Domingues1.6k
3
2.6 years ago by
Alex Reynolds25k
Seattle, WA USA
Alex Reynolds25k wrote:

EDIT - I suggest using `closest-features` in an answer at the bottom of this page. I would use that approach instead of this one. It will run fast and allow more relaxed assumptions about input.

Running under the assumption that your input intervals are entirely disjoint — none overlap each other — you can use BEDOPS tools to solve this easily.

First, sort your intervals with BEDOPS sort-bed:

``````\$ sort-bed < intervals.unsorted.bed > intervals.bed
``````

As a first pass, use bedmap directly with a symmetric one-base padding (via --range):

``````\$ bedmap --echo --echo-map --range 1 --delim '\t' intervals.bed > answer.bed
``````

Take a look at `answer.bed`. Because we use a symmetric padding, this result will contain multiple intervals, if one interval either follows before or after another by one base.

Per your question's guidelines, we only want mapped intervals that follow afterwards. One extra step is needed to filter out elements in `answer.bed` where an interval contains itself and another interval that comes before it:

``````\$ bedmap --echo --echo-map --range 1 --delim '\t' intervals.bed \
| awk '\$3 < \$8' - \
``````

The addition of the awk test filters out results where the stop position of a reference (unpadded) interval is greater than or equal to the start position of any mapped (padded) interval.

Depending on what "follow" means for reverse-stranded elements, there is an easy way to deal with that case. Feel free to follow up with what that means to you.

Also, if your input intervals are not guaranteed to be disjoint, follow up with that qualification and I'll see if I can modify my answer to help in that case.

2

I seriously love these detailed answers you give. Sometimes I want to put your name under the tag when I run into a problem that falls into the bedtools / bedops category just so make sure you take a look at it.

Thank you for the detailed answer.

Running under the assumption that your input intervals are entirely disjoint - none overlap each other

There are overlapping intervals.

Depending on what "follow" means for reverse-stranded elements

Sorry, that was not clear. It means (think of reads) <-interval2- <-interval1-. Or in bed format:

``````chr1 201 300 a3 . -  chr1 100 200 a1 . -
``````

I hope it is clearer now.

To deal with the stranded case, you would split intervals into two files, one for each strand:

``````\$ sort-bed < intervals.unsorted.bed | awk '\$6 == "+"' > intervals.for.bed
\$ sort-bed < intervals.unsorted.bed | awk '\$6 == "-"' > intervals.rev.bed
``````

For each strand, do separate tests.

For the forward strand elements, filter as shown in my original answer:

``````\$ bedmap --echo --echo-map --range 1 --delim '\t' intervals.for.bed \
| awk '\$3 < \$8' - \
``````

In the reverse-stranded case, change the test:

``````\$ bedmap --echo --echo-map --range 1 --delim '\t' intervals.rev.bed \
| awk '\$2 > \$9' - \
``````

If you want the results in one file:

``````\$ bedops --everything filtered.answer.{for,rev}.bed > filtered.answer.bed
``````

In the case where you have non-disjoint elements, you can filter them out and run the bedmap operation on the disjoint elements:

``````\$ sort-bed < intervals.unsorted.bed > intervals.bed
\$ bedmap --count --echo intervals.bed | awk '\$1==1' | cut -f2- > intervals.disjoint.bed
\$ bedmap --echo --echo-map --range 1 --delim '\t' intervals.disjoint.bed | awk '\$3 < \$8' > filtered.answer.disjoint.bed
``````

Working on disjoint elements alone reduces the number of (pairwise) tests on overlapping elements, which can require many comparisons, and can save some time.

Next, generate the set of non-disjoint or overlapping elements:

``````\$ bedops --not-element-of 1 intervals.bed intervals.disjoint.bed > intervals.overlapping.bed
``````

Run a forwards-sweeping pairwise comparison between all the overlapping elements, such as with the test that piet describes below. In other words, start at element `i` and compare it with elements `i+1`, `i+2`, etc.

Since the intervals are sorted on the start position (and chromosome), you can optimize this by only doing pairwise tests on an interval pair, up until the point at which the stop position of the first interval is less than the start position of the second interval. Once that condition is met, you can go to the next interval pair and start sweeping again.

For the stranded case, you can split by strand via awk and then do things similarly for disjoint elements. For overlapping, reverse-stranded elements, you could sweep backwards through the file one line at a time, testing interval pairs and applying the opposite position condition to optimize pairwise tests. In other words, you can stop testing a pair when start position of the first interval is greater than the stop position of the second interval.

1
2.6 years ago by
piet1.5k
planet earth
piet1.5k wrote:

This problem is a variant of calculating the "overlap between two intervals", see

The distance between two intervals may be determined as the negative overlap, and the special case zero means that the intervals are located next to each other without any space between. Here is a short example in Python.

``````def overlap(start1,end1,start2,end2):
return min(end1, end2) - max(start1, start2) + 1

print overlap(100,200,500,600)
print overlap(100,200,201,300)
print overlap(201,300,500,600)

-299
0
-199
``````

Given `n` unsorted intervals, you'd need to do `n(n-1)` overlap tests. You'd probably want to integrate sorting into this, so that you can optimize the number of comparisons.

Thanks piet. That was more or less the solution I had in mind (in R though). My fear was that it has to be put in a loop and be slow. But your solution does makes sense.

@Alex in my head sorted intervals are assumed.

1
2.6 years ago by
Alex Reynolds25k
Seattle, WA USA
Alex Reynolds25k wrote:

I just realized there is much simpler approach that allows for overlapping, stranded elements, without doing n(n-1) comparisons, using BEDOPS `closest-features`:

``````\$ closest-features --dist --no-overlaps --closest intervals.for.bed intervals.for.bed \
| awk -F '|' '\$3==2' \
| tr '|' '\t' \
``````

And for reverse-stranded elements:

``````\$ closest-features --dist --no-overlaps --closest intervals.rev.bed intervals.rev.bed \
| awk -F '|' '\$3==-2' \
| tr '|' '\t' \
``````

The `--dist` option reports the signed distance between elements, where positive is for downstream elements, and negative is for upstream elements. The `--no-overlaps` option removes consideration of elements which overlap themselves or other elements. The `--closest` option returns the single nearest query element, whether upstream or downstream of the input element.

We specify `intervals.{for,rev}.bed` twice, once as the input (or "reference") file, and again as the query file.

We use `awk` to filter for elements that are two bases away from each other (BED is 0-indexed, half-open).

We use `tr` to replace the pipe character `|` with a tab, to get a result closer to your desired format.

The `intervals.{for,rev}.bed` file should be sorted per BEDOPS `sort-bed` and split with `awk` before using `closest-features`, as described in a previous answer.

This approach should be much simpler and faster — definitely recommend using this approach over `bedmap`.

0
2.6 years ago by
vchris_ngs4.5k
Seattle,WA, USA
vchris_ngs4.5k wrote:

have you looked into sortBed -i option. It will sort you bed file or the genomic coordinate in the order you might be looking for and create a bed file only and then with any transpose function you can convert the rows to columns and get your desired output provided you will be getting tab separated values for each string since in any case bed files are tab separated and if you open in excel or in vi or nano you will be able to see that you actually have 6 columns for the input you put and the desired output you are asking is something in columnar format. So I believe you might just want to sort for column 2 with sortbed to a new bedfile or convert that bedfile to your desired format with any transpose method provided you convert each row into a single string.