I'm struggling with the following Python project, which I'll call multi-digest restriction mapping. It is originally a bioinformatic problem, but you really do not need any biology knowledge to understand the problem.
Imagine you have a DNA sequence (you do not know the actual nucleotides or anything else about it) and two restriction enzymes. Restriction enzymes are simply enzymes that cut DNA at enzyme-specific positions, called restriction sites (that's really all for the biology). When you place your sequence with one of the enzymes, the enzymes cuts it at its restriction sites. Two different enzymes will never have the same restriction site (eg. if enzyme 1 cuts at position 5, enzyme 2 cannot).
What you do is:
- You put the sequence with enzyme 1: you get a set of fragments, which I'm calling e1_distances
- You put the sequence with enzyme 2: you get a set of fragments, which I'm calling e2_distances
- You put the sequence with all three enzymes combined: you get a set of fragments, which I'm calling all_distances
Now suppose you are given these three sets of fragments (e1_distances, e2_distances, all_distances), what I need to get are the restriction sites for each enzyme i.e. where (the positions) where each enzyme has cut.
I'm trying to write an algorithm in Python to solve this problem. The same algorithm also needs to work when there are three enzymes (i.e. when I have e1_distances, e2_distances, e2_distances, and all_distances). I think it should be of systematic search or branch-and-bound type.
I'm given the lengths of the fragments obtained after enzyme 1 (circle in the example) cut, enzyme 2 (square in the example) cut, and two enzymes combined cut at the same time, respectively:
- e1_distances = [1, 2, 4, 5, 6]
- e2_distances = [1, 3, 3, 3, 8]
- all_distances = [1, 1, 1, 1, 2, 2, 2, 3, 5]
I'm supposed to get something like the following, i.e. the positions where the enzymes have cut:
- Positions where enzyme 1 cut: e1_sites = [1, 5, 10, 16]
- Positions where enzyme 2 cut: e2_sites = [3, 11, 14, 17]
Here is my code so far. It's not returning anything good but it was my first idea of the problem...
Can you help?
# Complete digest with three enzymes # sites are the lengths of the fragments we obtain after complete digestion e1_distances = [1, 2, 4, 5, 6] e2_distances = [1, 3, 3, 3, 8] all_distances = [1, 1, 1, 1, 2, 2, 2, 3, 5] # the list I'm iterating on all_distances_0 = [1, 1, 1, 1, 2, 2, 2, 3, 5] # the list I'll delete elements from # I think I should not iterate on a list I'm modifying at the same time, this is why I have two lists # sites are the restriction site positions for each enzyme e1_sites =  e2_sites =  all_sites =  # all_sites is simply the total of restriction sites for all three enzymes def appendNewSite (dis, sites_list): # appends the new restriction site to the two sets new_site = max (sites_list) + dis sites_list.append (new_site) all_sites.append (new_site) print sites_list print all_sites # at the end: if there are no more distances in all_distances, we are done if len(all_distances_0) == 0: print 'Enzyme 1 restriction sites: ', e1_sites print 'Enzyme 2 restriction sites: ', e2_sites print 'Total restriction sites: ', all_sites # if there is still something, we should place the corresponding restriction site in one of the set else: for d in all_distances: print '--> from all_distances, we pick distance:', d # What are the distances we are actually looking for in each enzyme multiset? d_e1 = max(all_sites) + d - e1_sites[len(e1_sites)-1] d_e2 = max(all_sites) + d - e2_sites[len(e2_sites)-1] print 'in e1_distances we are looking for', d_e1 print 'in e2_distances we are looking for', d_e2 # Is the distance d in one of the set of distances? if d_e1 in e1_distances: print d_e1, 'is in e1_distances' appendNewSite (d_e1, e1_sites) e1_distances.remove (d_e1) all_distances_0.remove (d) print 'e1_distances', e1_distances print 'all_distances', all_distances_0 elif d_e2 in e2_distances: print d_e2, 'is in e2_distances' appendNewSite (d_e2, e2_sites) e2_distances.remove (d_e2) all_distances_0.remove (d) print 'e2_distances', e2_distances print 'all_distances', all_distances_0
PS: Don't hesitate to tell me my code looks terrible, I'm new at this. Tell me however what I'm doing wrong.