Hello!
If we apply a basic algorithm to a reading frame to scan it and look for START and STOP codons (to assemble a possible protein), we get cases, when we have multiple START codons and one STOP codon:
Example:
Reading Frame:
['A', 'S', 'M', 'A', 'P', 'M', 'Q', 'P', 'I', 'T', 'P', 'S', 'A', 'T', '_', 'T']
We see that we have two START codons (M) and one STOP (_) codons and if we generate a possible protein from this reading frame, we will get two results:
1) MAPMQPITPSAT
2) MQPITPSAT
First one contains the second one in it and also has a START codon in it.
Question (from a programmer): Is there ever a need to generate sequences like that? Is it useful for some kind f statistics, or do we discard the first one and only use sequences that have one START and one STOP codon?
Regards. Juris.
I don't understand your problem. Many (most ?) proteins have more than one Methionin.
Oh. It is not a problem at all. I understand the process and I have my code working just fine. All I am asking, is, are those, multiple START codon sequences useful and do we have any real proteins that have double or more START codons in them? It does not look like there are? So it is not as problem I have, but a question I ask about application and usefulness of those sequences. Some sequences I work on, generate 100s of redundant amino acid chains like that that have multiple START codons.
Within reason (and supported by experimental evidence) proteins can have alternate start sites but every START codon you see is not going to code for a real protein.
I think I found a good answer here:
https://www.researchgate.net/post/im_working_on_a_gene_which_sequence_have_two_ATG_so_i_get_confused_that_which_ATG_is_start_codon
You would likely need to look for additional clues, such as proximity to promoters and ribosome binding sites to determine which ATG is the 'right one'.
Biology is messy though, and its quite possible that the forms that appear 'truncated' to us, are still produced and have some sort of role we don't yet understand.