Question: How To Separate A List Of Genomics Regions Into Several Disjoint Subsets
2
dustar1986330 wrote:

Hi,

I‘ve got a list of genomic regions in python, say:

``````list=["chr1: 100-300", "chr1:150-350", "chr1:200-400", "chr1:500-700", "chr1:600-800", "chr1:900-1000"]
``````

Some of them overlap each other. I wanna create a new list contain subsets of genomic regions and without overlapping regions in each subset. The result should be something like:

``````new_list=[["chr1:100-300", "chr1:500-700", "chr1:900-1000"], ["chr1:150-350", "chr1:600-800"], ["chr1:200-400"]]
``````

I created my own function but it tooks 20 mins to achieve the result, if the origin list contain 1000 regions.

python genomics • 2.4k views
modified 6.1 years ago by Biostar ♦♦ 20 • written 7.8 years ago by dustar1986330
2

It seems like you could have at least two different answers: `new_list_1=[["chr1:100-300", "chr1:500-700", "chr1:900-1000"], ["chr1:150-350", "chr1:600-800"], ["chr1:200-400"]]` or `new_list_2=[["chr1:100-300", "chr1:600-800", "chr1:900-1000"], ["chr1:150-350", "chr1:500-700"], ["chr1:200-400"]]`. Which result would be correct?

Either one will be great.

Could this be done with bedtools?

7
a.zielezinski9.3k wrote:

I would build a recursive function.

``````def disjoint(l, res=[]):
sl = []
n = 1
while n < len(l):
p,q = l[n-1:n+1]
if p >= q and p==q:
l.remove(q)
sl.append(q)
else:
n+=1
res.append(l)
if sl: disjoint(sl, res)
return res
``````

``````l = [('chr1',100,300), ('chr1', 150, 350), ('chr1', 200, 400), ('chr1', 500, 700), ('chr1',600, 800), ('chr1', 900, 1000)]
print disjoint(l)
``````

Gives you what you want:

``````[[('chr1', 100, 300), ('chr1', 500, 700), ('chr1', 900, 1000)], [('chr1', 150, 350), ('chr1', 600, 800)], [('chr1', 200, 400)]]
``````

This is exact what I want. Thank you very much indeed Zielezinski.

4
lh332k wrote:

As Alex has noted, your problem has multiple solutions. If you just want to find one of them, you can use a.zielezinski's script. If you want to iteratively find the maximum non-overlapping chain (i.e. you choose the chain having the longest total length and then choose the next longest chain from the rest of intervals), here is a naive solution based on dynamic programming.

Let `I(i)` be the i-th interval, `|I(i)|` be the length of the interval and `f(i)` the maximum total length if `I(i)` is chosen. We can compute `f(i)` with:

``````f(i) = max{f(j)+|I(i)| : j<i and I(j) not overlapping with I(i)}
``````

If you keep `j` that maximizes `f(i)` in the above equation, you can backtrack to get the interval list. This is an `O(N^2)` algorithm. I said it is naive because there are `O(NlogN)` algorithm for such problems. You usually need a tree to maintain a sorted list of the rightmost coordinates. But for your 1000 intervals, `O(N^2)` is more than enough.