Hello everyone!
I have written a very simple code to generate Proteins
from DNA/RNA
by generating 6 reading frames and matching all 6 against the DNA/RNA Codons
table:
DNA Codon table I use (from Wikipedia article about codons):
# 'M' - START, '_' - STOP
DNA_Codons = {
"GCT": "A", "GCC": "A", "GCA": "A", "GCG": "A",
"TGT": "C", "TGC": "C",
"GAT": "D", "GAC": "D",
"GAA": "E", "GAG": "E",
"TTT": "F", "TTC": "F",
"GGT": "G", "GGC": "G", "GGA": "G", "GGG": "G",
"CAT": "H", "CAC": "H",
"ATA": "I", "ATT": "I", "ATC": "I",
"AAA": "K", "AAG": "K",
"TTA": "L", "TTG": "L", "CTT": "L", "CTC": "L", "CTA": "L", "CTG": "L",
"ATG": "M",
"AAT": "N", "AAC": "N",
"CCT": "P", "CCC": "P", "CCA": "P", "CCG": "P",
"CAA": "Q", "CAG": "Q",
"CGT": "R", "CGC": "R", "CGA": "R", "CGG": "R", "AGA": "R", "AGG": "R",
"TCT": "S", "TCC": "S", "TCA": "S", "TCG": "S", "AGT": "S", "AGC": "S",
"ACT": "T", "ACC": "T", "ACA": "T", "ACG": "T",
"GTT": "V", "GTC": "V", "GTA": "V", "GTG": "V",
"TGG": "W",
"TAT": "Y", "TAC": "Y",
"TAA": "_", "TAG": "_", "TGA": "_"
}
I have tested this on many sequences from NCBI
and my simple reading framers -> translation
code works just fine for the most part. I found a few odd sequences on NCBI
and one of them is this:
https://www.ncbi.nlm.nih.gov/nuccore/JF909299.1
>JF909299.1 Homo sapiens insulin (INS) mRNA, partial cds
CTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTC
TACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGG
TGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCT
GCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGC
AACTA
The expected (as per NCBI
) translated sequence is this:
/translation="WGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
If we generate 6 reading frames
for JF909299.1
(based on the standard codon table) we will get this:
- LGT_PSRSLCEPTPVRLTPGGSSLPSVRGTRLLLHTQDPPGGRGPAGGAGGAGRGPWCRQPAALGPGGVPAEAWHCGTMLYQHLLPLPAGELLQL
- WGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
- GDLTQPQPL_TNTCAAHTWWKLST_CAGNEASSTHPRPAGRQRTCRWGRWSWAGALVQAACSPWPWRGPCRSVALWNNAVPASAPSTSWRTTAT
- _LQ_FSSW_REQMLVQHCSTMPRFCRDPSRAKGCRLPAPGPPPSSTCPTCRSSASRRVLGV_KKPRSPHTR_RASTRCEPHRCWFTKAAAGSGPQ
- SCSSSPAGRGSRCWYSIVPQCHASAGTPPGPRAAGCLHQGPRPAPPAPPAGPLPPGGSWVCRRSLVPRTLGRELPPGVSRTGVGSQRLRLGQVP
- VAVVLQLVEGADAGTALFHNATLLQGPLQGQGLQAACTRAPAQLHLPHLQVLCLPAGLGCVEEASFPAH_VESFHQV_AAQVLVHKGCGWVRSP
And as we can see, the second reading frame is the expected translation as per NCBI
, but that means that:
TGG -> W
and it was used as a start
codon. But TGG/UGG (W)
is not a start codon?
I have searched the net and did not find any information about a case where TGG/UGG
could be the start codon. This is supposed to be a standard Homo sapiens insulin (INS) mRNA
. Another example is this online tool example link, that produces exactly the same result as my code (found proteins are marked red).
Can anyone please help me to understand this particular case? Why NCBI
database tells us that TGG/UGG
can be a start codon and not a standard ATG/AUG
? This basically breaks my code, and I am looking to update it to support this logic as soon as I understand it.
Kind regards to this amazing community.
My understanding is that this sequence is a partial mRNA/CDS amplified through PCR and it's missing the first 16 codons (
MALWMRLLPLLALLAL
), inclusive the start codon metione.I blasted the protein sequence from NCBI on the human protein database of UniProt and I obtained the 100% identity alignment against P01308.
Then I did a global pairwise alignment with
Needle
fromEMBOSS
between the sequence retrieved from NCBI and UniProt (result available for a limited time).I hope this answers your question,
António
I think I see what it is now. It is, as you mentioned, a
partial cds
. It means partial coding sequence. OrcDNA
sequence if derived from acDNA
library. The partial designation means it doesn't have all the necessary start or stop sequences present.This makes more sense now. Just one last question about this. Why do we have a DNA sequence in
NCBI
that does not provide fullcds
in this case? I am just trying to understand the logic behind it.You can amplify DNA or mRNA sequences by PCR to sequence and submit them to NCBI.
If you are not familiar with PCR (Polimerase Chain Reaction) is more difficult to understand. Basically is a molecular biology technique to get a specific/target sequence from the full DNA genome. Imagine that you want to sequence only one particular gene, you can amplify by PCR that gene (essentially isolate that DNA template of that gene from the rest DNA genome material) and sequence it. PCR works based on primers that are small oligonucleotide sequences (~20 bp long) that will pair at the 5' and 3' (so you have at least always 1 pair of primers - forward and reverse) of the target gene region. Take the following example:
In this case, you will amplify the gene represented by the
+
sign/character. Amplify means isolation (due to the primers specificity) and synthesis of the target gene through multiple cycles (of amplification) in order to yield enough DNA molecules of that gene to sequence (otherwise you do not have enough material to sequence).Answering to your question. Primers are small stretches of DNA that only by change could pair with unspecific regions in the genome. Therefore when you're designing primers for a specific gene, you usually design primers for a specific very conserved/unique region of that gene to ensure some specificity. Of course this not always coincides with the beginning of the gene. Therefore these are partial gene/mRNA/CDS sequences that still can be submitted to NCBI/GenBank and appear in their database as incomplete gene sequences.
I'm not sure if this is the only reason why you have partial DNA/mRNA sequences, but at least is one of them.
Well, the technology has been evolving fast since 2005 with the emergence of next-generation sequencing technologies, but even during these days and the years that came after, sequencing was relatively expensive and involved many laborious steps of PCR or cloning. It still involves today, many library preps require PCR, but the technology provides higher throughput and it is easier and cheaper to sequence full DNA sequences, even entire genomes. At the end also depends on your objective. Let's say that you're assessing the potential mutations in one domain of the gene, you don't need to sequence the full gene.
I hope this helps,
António
Hey! Thanks. I am very new to Biology part of this. Just trying to write some basic DNA/RNA processing algorithms.
I am still confused about why
NCBI
has this particularDNA
sequence asORIGIN
, but when we apply a basic/standard translation algorithm to it, it does not produce whatNCBI
has as a translation result protein.Do we need to take additional steps in regard to that DNA sequence to produce that protein?
So, NCBI also has the full DNA sequence for this gene and respective translated protein (check 1st CDS - there are two): here
I think one of the criteria to choose the right ORF is the length of the predicted ORF (that would correspond to your 2nd and 5th predicted sequences). In this case, perhaps NCBI relied on some information that the user that submitted this sequence provided.