Question: How to translate protein sequences to Nucleotide sequences?
gravatar for Misha
9 months ago by
United States
Misha30 wrote:

I want to convert a list of fasta ( protein sequences) in a .text file into corresponding nucleotide sequences. A Google search gives me result of DNA to protein conversion but not vice versa. Also, I came across How do I find the nucleotide sequence of a protein using Biopython?, but this is what I am not looking for. Is there any possible way to do it using python.Moreover, I would like to solve it using python programming. I am sure there must be some way to do it rather than writing a code from scratch. Thanks!

ADD COMMENTlink modified 9 months ago by Mensur Dlakic7.1k • written 9 months ago by Misha30

would it be possible to give a bit of context?

Biologically it is (near) impossible to translate a protein back to its dna sequence.

You can translate the protein into a dna sequence but not into its dna sequence

and more on topic: if there is a biopython solution, why is that no good then? I'm no python expert but it should be possible to create a dictionary where every aminoacid points to a codon (3 nucleotides), then loop over each aminoacid and print the codon for it

ADD REPLYlink modified 9 months ago • written 9 months ago by lieven.sterck8.9k

Hello Misha!

It appears that your post has been cross-posted to another site:

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 9 months ago by ATpoint41k
gravatar for cschu181
9 months ago by
cschu1812.5k wrote:

As lieven.sterck points out: this returns you 'a' backtranslation of a peptide sequence. You could use a more dedicated statistical model using codon frequencies from your organism under study, but this is the gist of it:

import random

AA2NA = {
    "A": list("GCT,GCC,GCA,GCG".split(",")),
    "R": list("CGT,CGC,CGA,CGG,AGA,AGG".split(",")),
    "N": list("AAT,AAC".split(",")),
    "D": list("GAT,GAC".split(",")),
    "C": list("TGT,TGC".split(",")),
    "Q": list("CAA,CAG".split(",")),
    "E": list("GAA,GAG".split(",")),
    "G": list("GGT,GGC,GGA,GGG".split(",")),
    "H": list("CAT,CAC".split(",")),
    "I": list("ATT,ATC,ATA".split(",")),
    "L": list("TTA,TTG,CTT,CTC,CTA,CTG".split(",")),
    "K": list("AAA,AAG".split(",")),
    "M": list("ATG".split(",")),
    "F": list("TTT,TTC".split(",")),
    "P": list("CCT,CCC,CCA,CCG".split(",")),
    "S": list("TCT,TCC,TCA,TCG,AGT,AGC".split(",")),
    "T": list("ACT,ACC,ACA,ACG".split(",")),
    "W": list("TGG".split(",")),
    "Y": list("TAT,TAC".split(",")),
    "V": list("GTT,GTC,GTA,GTG".split(",")),
    "*": list("TAA,TGA,TAG".split(","))

def aa2na(seq):
    na_seq = [random.choice(AA2NA.get(c, ["---"])) for c in seq]
    return "".join(na_seq)


One possible output:


Edit: changed return value of AA2NA.get() for "unknown" amino acids to "---" instead of "-".

ADD COMMENTlink modified 9 months ago • written 9 months ago by cschu1812.5k

Thanks a lot for answering this.

ADD REPLYlink written 9 months ago by Misha30
gravatar for Mensur Dlakic
9 months ago by
Mensur Dlakic7.1k
Mensur Dlakic7.1k wrote:

There is something called codon degeneracy which means that multiple nucleotide triplets (codons) translate into the same amino-acid. Conversely, a single amino-acid can be translated into multiple codons, which is why there is no single solution for what you are asking.

ADD COMMENTlink written 9 months ago by Mensur Dlakic7.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1088 users visited in the last hour