Question: With Python, how can I solve this restriction mapping problem?
gravatar for francois
3.9 years ago by
francois10 wrote:

I'm struggling with the following Python project, which I'll call multi-digest restriction mapping. It is originally a bioinformatic problem, but you really do not need any biology knowledge to understand the problem.

Imagine you have a DNA sequence (you do not know the actual nucleotides or anything else about it) and two restriction enzymes. Restriction enzymes are simply enzymes that cut DNA at enzyme-specific positions, called restriction sites (that's really all for the biology). When you place your sequence with one of the enzymes, the enzymes cuts it at its restriction sites. Two different enzymes will never have the same restriction site (eg. if enzyme 1 cuts at position 5, enzyme 2 cannot).

What you do is:

  • You put the sequence with enzyme 1: you get a set of fragments, which I'm calling e1_distances
  • You put the sequence with enzyme 2: you get a set of fragments, which I'm calling e2_distances
  • You put the sequence with all three enzymes combined: you get a set of fragments, which I'm calling all_distances

Now suppose you are given these three sets of fragments (e1_distances, e2_distances, all_distances), what I need to get are the restriction sites for each enzyme i.e. where (the positions) where each enzyme has cut.

I'm trying to write an algorithm in Python to solve this problem. The same algorithm also needs to work when there are three enzymes (i.e. when I have e1_distances, e2_distances, e2_distances, and all_distances). I think it should be of systematic search or branch-and-bound type.

Example enter image description here

I'm given the lengths of the fragments obtained after enzyme 1 (circle in the example) cut, enzyme 2 (square in the example) cut, and two enzymes combined cut at the same time, respectively:

  • e1_distances = [1, 2, 4, 5, 6]
  • e2_distances = [1, 3, 3, 3, 8]
  • all_distances = [1, 1, 1, 1, 2, 2, 2, 3, 5]

I'm supposed to get something like the following, i.e. the positions where the enzymes have cut:

  • Positions where enzyme 1 cut: e1_sites = [1, 5, 10, 16]
  • Positions where enzyme 2 cut: e2_sites = [3, 11, 14, 17]

Here is my code so far. It's not returning anything good but it was my first idea of the problem...

Can you help?

# Complete digest with three enzymes

# sites are the lengths of the fragments we obtain after complete digestion
e1_distances = [1, 2, 4, 5, 6]
e2_distances = [1, 3, 3, 3, 8]
all_distances = [1, 1, 1, 1, 2, 2, 2, 3, 5] # the list I'm iterating on
all_distances_0 = [1, 1, 1, 1, 2, 2, 2, 3, 5] # the list I'll delete elements from
# I think I should not iterate on a list I'm modifying at the same time, this is why I have two lists

# sites are the restriction site positions for each enzyme
e1_sites = [0]
e2_sites = [0]
all_sites = [0]
# all_sites is simply the total of restriction sites for all three enzymes

def appendNewSite (dis, sites_list):
# appends the new restriction site to the two sets
    new_site = max (sites_list) + dis
    sites_list.append (new_site)
    all_sites.append (new_site)
    print sites_list
    print all_sites

# at the end: if there are no more distances in all_distances, we are done
if len(all_distances_0) == 0:
    print 'Enzyme 1 restriction sites: ', e1_sites
    print 'Enzyme 2 restriction sites: ', e2_sites
    print 'Total restriction sites: ', all_sites

# if there is still something, we should place the corresponding restriction site in one of the set
    for d in all_distances:

        print '--> from all_distances, we pick distance:', d

# What are the distances we are actually looking for in each enzyme multiset?
        d_e1 = max(all_sites) + d - e1_sites[len(e1_sites)-1]
        d_e2 = max(all_sites) + d - e2_sites[len(e2_sites)-1]

        print 'in e1_distances we are looking for', d_e1
        print 'in e2_distances we are looking for', d_e2

# Is the distance d in one of the set of distances?
        if d_e1 in e1_distances:

            print d_e1, 'is in e1_distances'

            appendNewSite (d_e1, e1_sites)

            e1_distances.remove (d_e1)
            all_distances_0.remove (d)
            print 'e1_distances', e1_distances
            print 'all_distances', all_distances_0

        elif d_e2 in e2_distances:
            print d_e2, 'is in e2_distances'

            appendNewSite (d_e2, e2_sites)

            e2_distances.remove (d_e2)
            all_distances_0.remove (d)
            print 'e2_distances', e2_distances
            print 'all_distances', all_distances_0

PS: Don't hesitate to tell me my code looks terrible, I'm new at this. Tell me however what I'm doing wrong.

restriction python • 3.3k views
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by francois10

I am so sorry my code looks so bad. I've tried several ways to put it all in the grey box, but it won't let me do it I'm not sure why. Can someone help me post it correctly? It starts with "#Complete digest with three enzymes". But it is understanding the sentence starting with # as titles, not comments.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by francois10

I have reformatted the code in your post (hint: highlight code text and then click on the 101010 button). See if it looks correct.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by GenoMax95k

If I get it right, then you can have a function that takes a vector of distances as input and returns it as a vector of sites. It does not really matter if you have 1, 2, 3, 4 or more restriction enzymes. This function is like this:

distances = [2,3,2,2]

def distances2sites( distances ):
  sites = [0]
  for i in range( 0, len(distances) ) :
    sites.append( sites[i] + distances[i] )
  return sites

sites = distances2sites( distances )

print sites
ADD REPLYlink written 3.9 years ago by Petr Ponomarenko2.7k

Thank you so much for your answer. My example was terrible because it implied that fragments (distances) were already in the right order. My bad! This is not actually the case. The order in which they are in the set is completely irrelevant/random. I've edited with another example and an illustration. I hope it is clearer now. Can you take a look?

ADD REPLYlink written 3.9 years ago by francois10

What you are talking about is double digest problem which is well studied and NP-complete. Because of this you need to write an algorithm that tests if a given answer is correct or not. Then for small datasets like in your example you can brute force (ie testing all potential scenarios will work), for bigger sets an optimization algorithm like genetic algorithms can be used to get reasonable time to compute. Also in real lab setting you have errors in fragment length, and empiric optimization algorithms can handle this better. Unless you add some extra information to the datasets the problem is very hard. Also, several answers can be valid for double digest. I can not write a program for you. It is too time-consuming and most likely outside the scope of Biostar's, but you can start from reading papers about this problem.

ADD REPLYlink written 3.9 years ago by Petr Ponomarenko2.7k

No, obviously I wasn't asking for anyone to write the program. Just some leads to get me started. Your answer was super helpful, thanks a lot.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by francois10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1226 users visited in the last hour