mRNA Translation: TGG/UGG as a start codon?
Entering edit mode
4 months ago
rebelCoder ▴ 20

Hello everyone!

I have written a very simple code to generate Proteins from DNA/RNA by generating 6 reading frames and matching all 6 against the DNA/RNA Codons table:

DNA Codon table I use (from Wikipedia article about codons):

# 'M' - START, '_' - STOP

DNA_Codons = {
"GCT": "A", "GCC": "A", "GCA": "A", "GCG": "A",
"TGT": "C", "TGC": "C",
"GAT": "D", "GAC": "D",
"GAA": "E", "GAG": "E",
"TTT": "F", "TTC": "F",
"GGT": "G", "GGC": "G", "GGA": "G", "GGG": "G",
"CAT": "H", "CAC": "H",
"ATA": "I", "ATT": "I", "ATC": "I",
"AAA": "K", "AAG": "K",
"TTA": "L", "TTG": "L", "CTT": "L", "CTC": "L", "CTA": "L", "CTG": "L",
"ATG": "M",
"AAT": "N", "AAC": "N",
"CCT": "P", "CCC": "P", "CCA": "P", "CCG": "P",
"CAA": "Q", "CAG": "Q",
"CGT": "R", "CGC": "R", "CGA": "R", "CGG": "R", "AGA": "R", "AGG": "R",
"TCT": "S", "TCC": "S", "TCA": "S", "TCG": "S", "AGT": "S", "AGC": "S",
"ACT": "T", "ACC": "T", "ACA": "T", "ACG": "T",
"GTT": "V", "GTC": "V", "GTA": "V", "GTG": "V",
"TGG": "W",
"TAT": "Y", "TAC": "Y",
"TAA": "_", "TAG": "_", "TGA": "_"

I have tested this on many sequences from NCBI and my simple reading framers -> translation code works just fine for the most part. I found a few odd sequences on NCBI and one of them is this:

>JF909299.1 Homo sapiens insulin (INS) mRNA, partial cds

The expected (as per NCBI) translated sequence is this:


If we generate 6 reading frames for JF909299.1 (based on the standard codon table) we will get this:


And as we can see, the second reading frame is the expected translation as per NCBI, but that means that: TGG -> W and it was used as a start codon. But TGG/UGG (W) is not a start codon?

I have searched the net and did not find any information about a case where TGG/UGG could be the start codon. This is supposed to be a standard Homo sapiens insulin (INS) mRNA. Another example is this online tool example link, that produces exactly the same result as my code (found proteins are marked red).

Can anyone please help me to understand this particular case? Why NCBI database tells us that TGG/UGG can be a start codon and not a standard ATG/AUG? This basically breaks my code, and I am looking to update it to support this logic as soon as I understand it.

Kind regards to this amazing community.

NCBI Translation mRNA Codons • 407 views
Entering edit mode

My understanding is that this sequence is a partial mRNA/CDS amplified through PCR and it's missing the first 16 codons (MALWMRLLPLLALLAL), inclusive the start codon metione.

I blasted the protein sequence from NCBI on the human protein database of UniProt and I obtained the 100% identity alignment against P01308.

Then I did a global pairwise alignment with Needle from EMBOSS between the sequence retrieved from NCBI and UniProt (result available for a limited time).

I hope this answers your question,


Entering edit mode

I think I see what it is now. It is, as you mentioned, a partial cds. It means partial coding sequence. Or cDNA sequence if derived from a cDNA library. The partial designation means it doesn't have all the necessary start or stop sequences present.

This makes more sense now. Just one last question about this. Why do we have a DNA sequence in NCBI that does not provide full cds in this case? I am just trying to understand the logic behind it.

Entering edit mode

You can amplify DNA or mRNA sequences by PCR to sequence and submit them to NCBI.

If you are not familiar with PCR (Polimerase Chain Reaction) is more difficult to understand. Basically is a molecular biology technique to get a specific/target sequence from the full DNA genome. Imagine that you want to sequence only one particular gene, you can amplify by PCR that gene (essentially isolate that DNA template of that gene from the rest DNA genome material) and sequence it. PCR works based on primers that are small oligonucleotide sequences (~20 bp long) that will pair at the 5' and 3' (so you have at least always 1 pair of primers - forward and reverse) of the target gene region. Take the following example:

          5' primer -->
 5' -------------------++++----------- 3' (genome)
                           <-- 3' primer

In this case, you will amplify the gene represented by the + sign/character. Amplify means isolation (due to the primers specificity) and synthesis of the target gene through multiple cycles (of amplification) in order to yield enough DNA molecules of that gene to sequence (otherwise you do not have enough material to sequence).

Answering to your question. Primers are small stretches of DNA that only by change could pair with unspecific regions in the genome. Therefore when you're designing primers for a specific gene, you usually design primers for a specific very conserved/unique region of that gene to ensure some specificity. Of course this not always coincides with the beginning of the gene. Therefore these are partial gene/mRNA/CDS sequences that still can be submitted to NCBI/GenBank and appear in their database as incomplete gene sequences.

I'm not sure if this is the only reason why you have partial DNA/mRNA sequences, but at least is one of them.

Why do we have a DNA sequence in NCBI that does not provide full cds in this case?

Well, the technology has been evolving fast since 2005 with the emergence of next-generation sequencing technologies, but even during these days and the years that came after, sequencing was relatively expensive and involved many laborious steps of PCR or cloning. It still involves today, many library preps require PCR, but the technology provides higher throughput and it is easier and cheaper to sequence full DNA sequences, even entire genomes. At the end also depends on your objective. Let's say that you're assessing the potential mutations in one domain of the gene, you don't need to sequence the full gene.

I hope this helps,


Entering edit mode

Hey! Thanks. I am very new to Biology part of this. Just trying to write some basic DNA/RNA processing algorithms.

I am still confused about why NCBI has this particular DNA sequence as ORIGIN, but when we apply a basic/standard translation algorithm to it, it does not produce what NCBI has as a translation result protein.

Do we need to take additional steps in regard to that DNA sequence to produce that protein?

Entering edit mode

So, NCBI also has the full DNA sequence for this gene and respective translated protein (check 1st CDS - there are two): here

Do we need to take additional steps in regard to that DNA sequence to produce that protein?

I think one of the criteria to choose the right ORF is the length of the predicted ORF (that would correspond to your 2nd and 5th predicted sequences). In this case, perhaps NCBI relied on some information that the user that submitted this sequence provided.

Entering edit mode
4 months ago

Ah, the first time I get to demonstrate my cool new software to the world :-) I've designed it exactly to explore these types of problems. I understand the author already figured out the answer, this is more a demonstration for the next person with the same problem of what they could do to figure out why their DNA does not match the protein

More info at, install it with

pip install bio

Let's get your data first

bio fetch JF909299

behind the scenes the data is stored as JSON, you can take a look at what this data contains:

bio convert JF909299 --json | more

Now get the origin sequence for the data:

bio convert JF909299 --fasta


>JF909299.1 Homo sapiens insulin (INS) mRNA, partial cds

We note above that it says "partial cds". Let's translate the origin:

bio convert JF909299 --fasta --translate

it looks odd indeed:

>JF909299.1 translated

Let's check the protein deposited with this record

 bio convert JF909299 --fasta --protein

it prints:

>AEG19452.1 ID=AEG19452.1;Name=AEG19452.1;gene=INS;note=decreases blood glucose concentration and increases cell permeability to monosaccharides, amino acids and fatty acids; accelerates glycolysis, pentose phosphate cycle, and glycogen synthesis in liver;codon_start=2;product=insulin;protein_id=AEG19452.1

Does not match the translation. Since it is a partial sequence maybe it is in a different frame, let's go back to the origin and translate the second frame:

bio convert JF909299 --fasta --translate -start 2

and that indeed matches:

>JF909299.1 [2:285] translated

let's keep looking, what organism is this data from

bio taxon JF909299

it prints:

species, Homo sapiens (human), 9606
   subspecies, Homo sapiens neanderthalensis (Neandertal), 63221
   subspecies, Homo sapiens subsp. 'Denisova' (Denisova hominin), 741158

what is the lineage:

 bio taxon JF909299 --lineage


no rank, cellular organisms, 131567
   superkingdom, Eukaryota (eucaryotes), 2759
      clade, Opisthokonta, 33154
         kingdom, Metazoa (metazoans), 33208
            clade, Eumetazoa, 6072
               clade, Bilateria, 33213
               ... ... ... ... ... ... ... ...
                              subfamily, Homininae, 207598
                                  genus, Homo, 9605
                                       species, Homo sapiens (human), 9606

what is known about insulin?

bio define insulin


GO:0005009  insulin-activated receptor activity
GO:0005010  insulin-like growth factor-activated receptor activity
GO:0005158  insulin receptor binding
GO:0005159  insulin-like growth factor receptor binding
GO:0005360  insulin-responsive glucose
GO:0005520  insulin-like growth factor binding

and there is a lot more that bio can do.

Entering edit mode

Amazing tool. Congrats!

Entering edit mode
4 months ago

JF909299.1 Homo sapiens insulin (INS) mRNA, partial cds

this is a partial dna . How do you know it contains the ATG ?

aligning the peptite shows it's missing the 5'/NH2 side

E-value: 5.3e-66
Score: 512
Ident.: 100.0%
Positives : 100.0%
Query Length: 94
Match Length: 110
Query1     WGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGS                                                                60
P01308 INS_HUMAN 17    WGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGS                                                                76
Query61    LQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN                                      94
P01308 INS_HUMAN 77    LQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN                                     110

Login before adding your answer.

Traffic: 1644 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6