How to identify consecutive (adjacent) genomic intervals
4
0
Entering edit mode
6.4 years ago
A. Domingues ★ 2.6k

I have a list of genomic coordinates, bed format, and I want to determine which intervals are immediately followed by another. For example:

chr1 100 200 a1 . +
chr1 500 600 a2 . +
chr1 201 300 a3 . +


So in this case the ouput would be:

chr1 100 200 a1 . + chr1 201 300 a3 . +


(there also intervals in the minus strand)

My quest is on the face of it quite simple, but I can't think of a straighforward solution. I also looked for existing solutions using IRanges (R/Bioconductor) or bedtools but found none. Something in R/Bioconductor would be appreciated, but bash or python are also welcome.

Latest bed genomic intervals • 2.9k views
3
Entering edit mode
6.4 years ago

EDIT - I suggest using closest-features in an answer at the bottom of this page. I would use that approach instead of this one. It will run fast and allow more relaxed assumptions about input.

Running under the assumption that your input intervals are entirely disjoint — none overlap each other — you can use BEDOPS tools to solve this easily.

First, sort your intervals with BEDOPS sort-bed:

$sort-bed < intervals.unsorted.bed > intervals.bed  As a first pass, use bedmap directly with a symmetric one-base padding (via --range): $ bedmap --echo --echo-map --range 1 --delim '\t' intervals.bed > answer.bed


Take a look at answer.bed. Because we use a symmetric padding, this result will contain multiple intervals, if one interval either follows before or after another by one base.

Per your question's guidelines, we only want mapped intervals that follow afterwards. One extra step is needed to filter out elements in answer.bed where an interval contains itself and another interval that comes before it:

$bedmap --echo --echo-map --range 1 --delim '\t' intervals.bed \ | awk '$3 < $8' - \ > filtered.answer.bed  The addition of the awk test filters out results where the stop position of a reference (unpadded) interval is greater than or equal to the start position of any mapped (padded) interval. Depending on what "follow" means for reverse-stranded elements, there is an easy way to deal with that case. Feel free to follow up with what that means to you. Also, if your input intervals are not guaranteed to be disjoint, follow up with that qualification and I'll see if I can modify my answer to help in that case. ADD COMMENT 2 Entering edit mode I seriously love these detailed answers you give. Sometimes I want to put your name under the tag when I run into a problem that falls into the bedtools / bedops category just so make sure you take a look at it. ADD REPLY 0 Entering edit mode Thank you for the detailed answer. Running under the assumption that your input intervals are entirely disjoint - none overlap each other There are overlapping intervals. Depending on what "follow" means for reverse-stranded elements Sorry, that was not clear. It means (think of reads) <-interval2- <-interval1-. Or in bed format: chr1 201 300 a3 . - chr1 100 200 a1 . -  I hope it is clearer now. ADD REPLY 0 Entering edit mode To deal with the stranded case, you would split intervals into two files, one for each strand: $ sort-bed < intervals.unsorted.bed | awk '$6 == "+"' > intervals.for.bed$ sort-bed < intervals.unsorted.bed | awk '$6 == "-"' > intervals.rev.bed  For each strand, do separate tests. For the forward strand elements, filter as shown in my original answer: $ bedmap --echo --echo-map --range 1 --delim '\t' intervals.for.bed \
| awk '$3 <$8' - \


In the reverse-stranded case, change the test:

$bedmap --echo --echo-map --range 1 --delim '\t' intervals.rev.bed \ | awk '$2 > $9' - \ > filtered.answer.rev.bed  If you want the results in one file: $ bedops --everything filtered.answer.{for,rev}.bed > filtered.answer.bed

0
Entering edit mode

In the case where you have non-disjoint elements, you can filter them out and run the bedmap operation on the disjoint elements:

$sort-bed < intervals.unsorted.bed > intervals.bed$ bedmap --count --echo intervals.bed | awk '$1==1' | cut -f2- > intervals.disjoint.bed$ bedmap --echo --echo-map --range 1 --delim '\t' intervals.disjoint.bed | awk '$3 <$8' > filtered.answer.disjoint.bed


Working on disjoint elements alone reduces the number of (pairwise) tests on overlapping elements, which can require many comparisons, and can save some time.

Next, generate the set of non-disjoint or overlapping elements:

$bedops --not-element-of 1 intervals.bed intervals.disjoint.bed > intervals.overlapping.bed  Run a forwards-sweeping pairwise comparison between all the overlapping elements, such as with the test that piet describes below. In other words, start at element i and compare it with elements i+1, i+2, etc. Since the intervals are sorted on the start position (and chromosome), you can optimize this by only doing pairwise tests on an interval pair, up until the point at which the stop position of the first interval is less than the start position of the second interval. Once that condition is met, you can go to the next interval pair and start sweeping again. For the stranded case, you can split by strand via awk and then do things similarly for disjoint elements. For overlapping, reverse-stranded elements, you could sweep backwards through the file one line at a time, testing interval pairs and applying the opposite position condition to optimize pairwise tests. In other words, you can stop testing a pair when start position of the first interval is greater than the stop position of the second interval. ADD REPLY 1 Entering edit mode 6.4 years ago piet ★ 1.8k This problem is a variant of calculating the "overlap between two intervals", see The distance between two intervals may be determined as the negative overlap, and the special case zero means that the intervals are located next to each other without any space between. Here is a short example in Python. def overlap(start1,end1,start2,end2): return min(end1, end2) - max(start1, start2) + 1 print overlap(100,200,500,600) print overlap(100,200,201,300) print overlap(201,300,500,600) -299 0 -199  ADD COMMENT 0 Entering edit mode Given n unsorted intervals, you'd need to do n(n-1) overlap tests. You'd probably want to integrate sorting into this, so that you can optimize the number of comparisons. ADD REPLY 0 Entering edit mode Thanks piet. That was more or less the solution I had in mind (in R though). My fear was that it has to be put in a loop and be slow. But your solution does makes sense. @Alex in my head sorted intervals are assumed. ADD REPLY 1 Entering edit mode 6.3 years ago I just realized there is much simpler approach that allows for overlapping, stranded elements, without doing n(n-1) comparisons, using BEDOPS closest-features: $ closest-features --dist --no-overlaps --closest intervals.for.bed intervals.for.bed \
| awk -F '|' '$3==2' \ | tr '|' '\t' \ > answer.for.bed  And for reverse-stranded elements: $ closest-features --dist --no-overlaps --closest intervals.rev.bed intervals.rev.bed \
| awk -F '|' '\$3==-2' \
| tr '|' '\t' \


The --dist option reports the signed distance between elements, where positive is for downstream elements, and negative is for upstream elements. The --no-overlaps option removes consideration of elements which overlap themselves or other elements. The --closest option returns the single nearest query element, whether upstream or downstream of the input element.

We specify intervals.{for,rev}.bed twice, once as the input (or "reference") file, and again as the query file.

We use awk to filter for elements that are two bases away from each other (BED is 0-indexed, half-open).

We use tr to replace the pipe character | with a tab, to get a result closer to your desired format.

The intervals.{for,rev}.bed file should be sorted per BEDOPS sort-bed and split with awk before using closest-features, as described in a previous answer.

This approach should be much simpler and faster — definitely recommend using this approach over bedmap.

0
Entering edit mode
6.4 years ago
ivivek_ngs ★ 5.1k

have you looked into sortBed -i option. It will sort you bed file or the genomic coordinate in the order you might be looking for and create a bed file only and then with any transpose function you can convert the rows to columns and get your desired output provided you will be getting tab separated values for each string since in any case bed files are tab separated and if you open in excel or in vi or nano you will be able to see that you actually have 6 columns for the input you put and the desired output you are asking is something in columnar format. So I believe you might just want to sort for column 2 with sortbed to a new bedfile or convert that bedfile to your desired format with any transpose method provided you convert each row into a single string.