Question: How To Separate A List Of Genomics Regions Into Several Disjoint Subsets
gravatar for dustar1986
5.1 years ago by
dustar1986260 wrote:


I‘ve got a list of genomic regions in python, say:

list=["chr1: 100-300", "chr1:150-350", "chr1:200-400", "chr1:500-700", "chr1:600-800", "chr1:900-1000"]

Some of them overlap each other. I wanna create a new list contain subsets of genomic regions and without overlapping regions in each subset. The result should be something like:

new_list=[["chr1:100-300", "chr1:500-700", "chr1:900-1000"], ["chr1:150-350", "chr1:600-800"], ["chr1:200-400"]]

I created my own function but it tooks 20 mins to achieve the result, if the origin list contain 1000 regions.

Is there any quick way to do that? Please help.

python genomics • 1.5k views
ADD COMMENTlink modified 3.4 years ago by Biostar ♦♦ 20 • written 5.1 years ago by dustar1986260

It seems like you could have at least two different answers: new_list_1=[["chr1:100-300", "chr1:500-700", "chr1:900-1000"], ["chr1:150-350", "chr1:600-800"], ["chr1:200-400"]] or new_list_2=[["chr1:100-300", "chr1:600-800", "chr1:900-1000"], ["chr1:150-350", "chr1:500-700"], ["chr1:200-400"]]. Which result would be correct?

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Alex Reynolds22k

Either one will be great.

ADD REPLYlink written 5.1 years ago by dustar1986260

Could this be done with bedtools?

ADD REPLYlink written 5.1 years ago by 141341254653464453.3k
gravatar for a.zielezinski
5.1 years ago by
a.zielezinski7.8k wrote:

I would build a recursive function.

def disjoint(l, res=[]):
    sl = []
    n = 1
    while n < len(l):
        p,q = l[n-1:n+1]
        if p[2] >= q[1] and p[0]==q[0]:
    if sl: disjoint(sl, res)
    return res

Running it on your example:

l = [('chr1',100,300), ('chr1', 150, 350), ('chr1', 200, 400), ('chr1', 500, 700), ('chr1',600, 800), ('chr1', 900, 1000)] 
print disjoint(l)

Gives you what you want:

[[('chr1', 100, 300), ('chr1', 500, 700), ('chr1', 900, 1000)], [('chr1', 150, 350), ('chr1', 600, 800)], [('chr1', 200, 400)]]
ADD COMMENTlink written 5.1 years ago by a.zielezinski7.8k

This is exact what I want. Thank you very much indeed Zielezinski.

ADD REPLYlink written 5.1 years ago by dustar1986260
gravatar for lh3
5.1 years ago by
United States
lh330k wrote:

As Alex has noted, your problem has multiple solutions. If you just want to find one of them, you can use a.zielezinski's script. If you want to iteratively find the maximum non-overlapping chain (i.e. you choose the chain having the longest total length and then choose the next longest chain from the rest of intervals), here is a naive solution based on dynamic programming.

Let I(i) be the i-th interval, |I(i)| be the length of the interval and f(i) the maximum total length if I(i) is chosen. We can compute f(i) with:

f(i) = max{f(j)+|I(i)| : j<i and I(j) not overlapping with I(i)}

If you keep j that maximizes f(i) in the above equation, you can backtrack to get the interval list. This is an O(N^2) algorithm. I said it is naive because there are O(NlogN) algorithm for such problems. You usually need a tree to maintain a sorted list of the rightmost coordinates. But for your 1000 intervals, O(N^2) is more than enough.

ADD COMMENTlink written 5.1 years ago by lh330k

Thanks Ih3. Sorry I did not make it clear. I only need one out of multiple solutions. But yours is really smart.

ADD REPLYlink written 5.1 years ago by dustar1986260
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1197 users visited in the last hour