Question: Resolving Approximate Translation For Incomplete Reconstructed Transcript
3
gravatar for 2184687-1231-83-
9.0 years ago by
2184687-1231-83-5.0k wrote:

Hi,

I've got transcripts with incomplete sequences that are padded by strings of NNNs like this: http://www.ebi.ac.uk/~avilella/file.fasta

I would like to resolve an approximate translation for this transcript that stretches/compresses the strings of NNNs to resolve possible frameshifts and gives the longest possible translation.

What program could I use for this?

Cheers

translation transcript • 1.5k views
ADD COMMENTlink modified 3.9 years ago by Biostar ♦♦ 20 • written 9.0 years ago by 2184687-1231-83-5.0k

So if I understand correctly, the length of the "NNN" segments is approximate? Do you have upper/lower bounds for those segment lengths?

ADD REPLYlink written 9.0 years ago by Neilfws48k

With so many segments composed of 'N's, you would probably end up with many possible alternatives, all with the same maximum translation length. In fact, the exact length of the Ns will be impossible to know since only 6 cases of ORF are possible (tell me if I'm wrong). Hence, let's say 22 is a possible length for a given N segment, then so will be any number of the form (1 + 3*x), which represents a +2 ORF. The problem is then that to maximize ORF length while permitting sequences of N that can take only 3 values (N, NN, NNN). What do you think?

ADD REPLYlink written 9.0 years ago by Eric Normandeau10k

I think this is quite a complex problem! The aim is to alter the length of "NNN" segments in order to maximise ORF length. Clearly we could "stretch" each set of N for as long as we wish, provided that we get a frame across the entire sequence. So there has to be an upper limit on N lengths.

ADD REPLYlink written 9.0 years ago by Neilfws48k

Looking at the fasta file, I'm still unclear as to whether the 'N' are padding (that is, exact length of N segments unknown) or just "base unknown", but segment length is exact. If the latter, see my answer below.

ADD REPLYlink written 9.0 years ago by Neilfws48k
3
gravatar for Neilfws
9.0 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

A partial answer: the EMBOSS package sixpack will translate a sequence containing 'N' and write out all possible protein sequences to a fasta file, from where you can select the longest.

The issue of stretching/compressing the "N" segments is more complex: see discussion in comments under the question. Obviously, you could add as many "N" as you like to maximise ORF length, so there have to be some upper and lower bounds on how many "N" are allowed.

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.0 years ago by Neilfws48k
2
gravatar for Michael Dondrup
9.0 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I'm sorry to discourage you, but your problem is intractable, at least if the number of N is not exact. The reason is simple: you get too many possible combinations even out of your small example. In your example file you have, as I count, 14 blocks of 'real' sequence. They could be in any reading frame, and each block can be in a different reading frame, so in total you got in the worst case (2 is for both directions): 2*3^14 = 9565938 possibilities for a total translation.

Even if the maximal shift in one block of N's was +-1 only, that would not help at all, because after three blocks again you have the same dilemma. Even if you could compute all combination, you could not interpret them, and it makes no sense to try to output them all.

Alternative idea I just thought out, basically forget about the N's:

  1. split the sequence into segments of 'real sequence', discarding the N's
  2. for +- strand, order segments from first to last
  3. translate each segment in 3 frames, marking stop-codons
  4. now, search from the last segment (eg. if in + strand, start with the last segment) for the longest translation that is not containg a stop codon
  5. discard all translated segments closer to the start that contain a stop codon
  6. if all three alternatives for a segment contain stop codons: this is your new end of longest translation

Output: Output the remaining alternatives segment-wise instead of all possible combinations

Maybe Pierre can implement this too, does it make sense?

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.0 years ago by Michael Dondrup46k
1
gravatar for Pierre Lindenbaum
9.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

Brute force method using C. And obviously, there are maaaaany results....

compile with:

gcc -Wall -O3 prog.c

exec:

a.out

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.0 years ago by Pierre Lindenbaum121k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 952 users visited in the last hour