9.0 years ago by
I'm sorry to discourage you, but your problem is intractable, at least if the number of N is not exact. The reason is simple: you get too many possible combinations even out of your small example. In your example file you have, as I count, 14 blocks of 'real' sequence. They could be in any reading frame, and each block can be in a different reading frame, so in total you got in the worst case (2 is for both directions):
2*3^14 = 9565938 possibilities for a total translation.
Even if the maximal shift in one block of N's was +-1 only, that would not help at all, because after three blocks again you have the same dilemma. Even if you could compute all combination, you could not interpret them, and it makes no sense to try to output them all.
Alternative idea I just thought out, basically forget about the N's:
- split the sequence into segments of 'real sequence', discarding the N's
- for +- strand, order segments from first to last
- translate each segment in 3 frames, marking stop-codons
- now, search from the last segment (eg. if in + strand, start with the last segment) for the longest translation that is not containg a stop codon
- discard all translated segments closer to the start that contain a stop codon
- if all three alternatives for a segment contain stop codons: this is your new end of longest translation
Output: Output the remaining alternatives segment-wise instead of all possible combinations
Maybe Pierre can implement this too, does it make sense?