Algorithms for finding nearest range given two lists of ranges?
2
0
Entering edit mode
5.4 years ago
endrebak ▴ 930

Are there known good datastructures/algorithms for finding the nearest region like bedtools closest?

From the top of my head:

If I have a datastructure to do interval lookup (like an intervaltree) I can for each range in A: (a_start, a_end) do

hits = []
i = 0
slack = 1000
while not hits:
hits = interval_tree.find(start - slack * i, end + slack * i)
i += 1

find_nearest_in_hits(start, end, hits)


But even in C this is might be slooow, depending on how the data looks.

I can use a large slack, but that would just require me to do more work in find_nearest_in_hits since it would get a larger result set most of the time.

(Here the intervaltree contains the ranges in B, while I use A to query it).

algorithms • 1.8k views
0
Entering edit mode

If you could give some examples, it will help us understand your problems much easier... And in your code, you didn't explain what the 'interval_tree' and 'find_nearest_in_hits' variables are

0
Entering edit mode

1
Entering edit mode
5.4 years ago

You can use a kd-tree with k=2.

0
Entering edit mode

Brilliant. There is both a scipy implementation: https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree and a C library that seems like it'll be easy to wrap: https://github.com/jtsiomb/kdtree

1
Entering edit mode
5.4 years ago

Sort your lists of intervals. You can do a closest-features search more quickly on sorted lists.

For working code and paper, see:

0
Entering edit mode

https://github.com/bedops/bedops/blob/master/applications/bed/closestfeats/src/ClosestFeature.cpp Thanks. Iff sorting is much faster than KD-tree construction, I'll try to use your method (I suspect it'll be the other way around though).

1
Entering edit mode

Sorting and KD-tree time complexity are both O(n log n). Storage of a KD-tree in memory will cost O(n). Comparison of intervals from walking through a pair of sorted lists should generally take much, much less memory, in practice. I suspect using sorted data will work much better with whole-genome scale datasets, but this may depend on your application and what you're trying to do.

0
Entering edit mode

Mostly depends on the length distribution of the intervals, I think. When using sorted lists, I’ve had good results dividing the intervals into two buckets (one large bucket containing all intervals with length < threshold, and one small bucket containing the very long intervals), and keeping track of the maximum interval length to minimize the number of intervals in the second bucket that need to be scanned per query.