Question

How to translate protein sequences to Nucleotide sequences?

3

Entering edit mode

4.2 years ago

Misha ▴ 60

I want to convert a list of fasta ( protein sequences) in a .text file into corresponding nucleotide sequences. A Google search gives me result of DNA to protein conversion but not vice versa. Also, I came across How do I find the nucleotide sequence of a protein using Biopython?, but this is what I am not looking for. Is there any possible way to do it using python.Moreover, I would like to solve it using python programming. I am sure there must be some way to do it rather than writing a code from scratch. Thanks!

protein sequence Nucleotide translation • 5.5k views

ADD COMMENT • link updated 4.2 years ago by Mensur Dlakic ★ 27k • written 4.2 years ago by Misha ▴ 60

1

Entering edit mode

would it be possible to give a bit of context?

Biologically it is (near) impossible to translate a protein back to its dna sequence.

You can translate the protein into a dna sequence but not into its dna sequence

and more on topic: if there is a biopython solution, why is that no good then? I'm no python expert but it should be possible to create a dictionary where every aminoacid points to a codon (3 nucleotides), then loop over each aminoacid and print the codon for it

ADD REPLY • link 4.2 years ago by lieven.sterck 15k

1

Entering edit mode

Hello Misha!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/11345/how-to-translate-amino-acid-sequences-to-nucleotide-sequences

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 4.2 years ago by ATpoint 81k

2

Entering edit mode

4.2 years ago

Mensur Dlakic ★ 27k

There is something called codon degeneracy which means that multiple nucleotide triplets (codons) translate into the same amino-acid. Conversely, a single amino-acid can be translated into multiple codons, which is why there is no single solution for what you are asking.

ADD COMMENT • link 4.2 years ago by Mensur Dlakic ★ 27k

score 4 · Accepted Answer · 2020-02-08

As lieven.sterck points out: this returns you 'a' backtranslation of a peptide sequence. You could use a more dedicated statistical model using codon frequencies from your organism under study, but this is the gist of it:

import random

AA2NA = {
    "A": list("GCT,GCC,GCA,GCG".split(",")),
    "R": list("CGT,CGC,CGA,CGG,AGA,AGG".split(",")),
    "N": list("AAT,AAC".split(",")),
    "D": list("GAT,GAC".split(",")),
    "C": list("TGT,TGC".split(",")),
    "Q": list("CAA,CAG".split(",")),
    "E": list("GAA,GAG".split(",")),
    "G": list("GGT,GGC,GGA,GGG".split(",")),
    "H": list("CAT,CAC".split(",")),
    "I": list("ATT,ATC,ATA".split(",")),
    "L": list("TTA,TTG,CTT,CTC,CTA,CTG".split(",")),
    "K": list("AAA,AAG".split(",")),
    "M": list("ATG".split(",")),
    "F": list("TTT,TTC".split(",")),
    "P": list("CCT,CCC,CCA,CCG".split(",")),
    "S": list("TCT,TCC,TCA,TCG,AGT,AGC".split(",")),
    "T": list("ACT,ACC,ACA,ACG".split(",")),
    "W": list("TGG".split(",")),
    "Y": list("TAT,TAC".split(",")),
    "V": list("GTT,GTC,GTA,GTG".split(",")),
    "*": list("TAA,TGA,TAG".split(","))
}

def aa2na(seq):
    na_seq = [random.choice(AA2NA.get(c, ["---"])) for c in seq]
    return "".join(na_seq)

print("MARNDCQEGHILKMFPSTWYV*", aa2na("MARNDCQEGHILKMFPSTWYV*"))

One possible output:

MARNDCQEGHILKMFPSTWYV* ATGGCTCGAAATGACTGCCAAGAGGGACACATTCTTAAAATGTTTCCGAGTACCTGGTACGTCTAA

Edit: changed return value of AA2NA.get() for "unknown" amino acids to "---" instead of "-".