How to back translate the protein into amino acids sequence with python?
2
0
Entering edit mode
5.3 years ago

Dear all, I am trying to translate protein back into RNA sequences. And when doing so, this kind of back translation gives a lot of RNA sequences.

So for that, I wrote a dictionary of all the possibilities present. And then trying to back translate it with python. But the first problem is I don't want to blow up my computer by printing out the result. For that, I made a mock fasta file containing a mock sequence and try to run this. This file contains MPG and MPPP as protein sequences: -But this doesn't yield the result as PPP 3 times will be seen as 1 time? Suggestions anyone?

Thank u

RNA-Seq protein sequence python • 6.3k views
ADD COMMENT
0
Entering edit mode

Please correct you title, I believe you mean How to back back translate the protein into RNA sequence with python?

Show the code you wrote so far.

ADD REPLY
0
Entering edit mode

ok

My mock protein are in fasta file called HIJ2. fasta. Protein sequences are MPG and MPPP.

    amino_x = {   
    'M': ['GCU', 'GCC', 'GCA'], 
    'P': ['UGU','UGC'],
    'G': ['AAA']
     }

from Bio import SeqIO
for protein in SeqIO.parse("HIJ2.fasta","fasta"): 
    for i in protein:
        amino_acids = list()

        for value in protein:
            amino_acids.append(amino_x[value])
    print(amino_acids)

from itertools import product
    list(itertools.product(amino_acids))

This code is with list.

ADD REPLY
0
Entering edit mode

Sorry, I tried to put the code inside the code kernel but this doesn't work.

ADD REPLY
0
Entering edit mode

Like this?

from Bio import SeqIO
import itertools

amino_x = {
    'M': ['GCU', 'GCC', 'GCA'],
    'P': ['UGU','UGC'],
    'G': ['AAA']
 }

for protein in SeqIO.parse("HIJ2.fasta","fasta"):
    printprotein.id)
    #    ^ there must be a ( after print. Due to a known bug in this forum software it is not shown.
    amino_acids = list()

    for value in protein.seq:
        amino_acids.append(amino_x[value])

    for translate_back in itertools.product(*amino_acids):
        print(translate_back)

Output:

A
('GCU', 'UGU', 'AAA')
('GCU', 'UGC', 'AAA')
('GCC', 'UGU', 'AAA')
('GCC', 'UGC', 'AAA')
('GCA', 'UGU', 'AAA')
('GCA', 'UGC', 'AAA')
B
('UGU', 'UGU', 'UGU')
('UGU', 'UGU', 'UGC')
('UGU', 'UGC', 'UGU')
('UGU', 'UGC', 'UGC')
('UGC', 'UGU', 'UGU')
('UGC', 'UGU', 'UGC')
('UGC', 'UGC', 'UGU')
('UGC', 'UGC', 'UGC')

HIJ2.fasta

>A
MPG
>B
PPP
ADD REPLY
0
Entering edit mode

Can this code work for longer protein sequences? And how can I add some code that will only print out, lets say 3 rna sequences and not all of them.

ADD REPLY
1
Entering edit mode
5.3 years ago
Joe 21k

You haven't said what the error was with your code, but the following subtly modified version works for me:

Input:

>Pep1
PPP
>Pep2
MPG

Code:

import sys
from Bio import SeqIO
from itertools import product

amino_x = {'M': ['GCU', 'GCC', 'GCA'],
           'P': ['UGU','UGC'],
           'G': ['AAA']}

for protein in SeqIO.parse(sys.argv[1], "fasta"):
    for i in protein:
        amino_acids = []
        for residue in protein:
            amino_acids.append(amino_x[residue])
    print("{} -> {}".formatprotein.id, amino_acids))

    for a in product(*amino_acids):
        print(a)

Yields:

Pep1 -> [['UGU', 'UGC'], ['UGU', 'UGC'], ['UGU', 'UGC']]
('UGU', 'UGU', 'UGU')
('UGU', 'UGU', 'UGC')
('UGU', 'UGC', 'UGU')
('UGU', 'UGC', 'UGC')
('UGC', 'UGU', 'UGU')
('UGC', 'UGU', 'UGC')
('UGC', 'UGC', 'UGU')
('UGC', 'UGC', 'UGC')
Pep2 -> [['GCU', 'GCC', 'GCA'], ['UGU', 'UGC'], ['AAA']]
('GCU', 'UGU', 'AAA')
('GCU', 'UGC', 'AAA')
('GCC', 'UGU', 'AAA')
('GCC', 'UGC', 'AAA')
('GCA', 'UGU', 'AAA')
('GCA', 'UGC', 'AAA')

Is that not what you were looking for?


Just FYI, for a long sequence, this back translation is going to get huge.

For a peptide of just 19 residues, the run time is already over 5 seconds (there is a reason BioPython doesn't offer back translation natively):

real    0m5.169s
user    0m0.568s
sys     0m0.376s

For 30 residues this climbs to over 48 minutes! - and I have no idea how much longer it would have gone on for, I killed it after I got bored!

  File "backtranslate.py", line 17, in <module>
    print(a)
KeyboardInterrupt

real    48m41.443s
user    9m37.704s
sys     6m38.621s

And this only considering 3 amino acids with a total of 6 codons. This will simply be incalculable if you want to do anything more complicated.

ADD COMMENT
0
Entering edit mode

I know it takes a lot of time to run these sequences. That's why I might add some code to print only 3 translations of each protein. But still translate all and print only lets say 3.

ADD REPLY
0
Entering edit mode

And I tried it with itertool as well, its much faster and print out the possible results very fast.

ADD REPLY
0
Entering edit mode

If you are happy with this notation:

Pep2 -> [['GCU', 'GCC', 'GCA'], ['UGU', 'UGC'], ['AAA']]

Instead of needing all the cartesian products:

('GCU', 'UGU', 'AAA')
('GCU', 'UGC', 'AAA')
('GCC', 'UGU', 'AAA')
('GCC', 'UGC', 'AAA')
('GCA', 'UGU', 'AAA')
('GCA', 'UGC', 'AAA')

It will be much faster.

I don't think you fully appreicate how big the task is though. To enumerate all the possibilities for a single real world protein, is likely weeks or months of computation.

ADD REPLY
0
Entering edit mode
5.3 years ago

You have to break the for after you reached the number of sequences you like to print.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
from argparse import ArgumentParser
from Bio import SeqIO
import itertools


def main():
    args = get_args()

    amino_x = {
        'M': ['GCU', 'GCC', 'GCA'],
        'P': ['UGU','UGC'],
        'G': ['AAA']
    }

    for protein in SeqIO.parse(args.input,"fasta"):
        printprotein.id)
        #    ^ there must be a ( after print. Due to a known bug in this forum software it is not shown.
        amino_acids = list()

        for value in protein.seq:
            amino_acids.append(amino_x[value])

        for i, translate_back in enumerate(itertools.product(*amino_acids)):
            print(translate_back)

            if i == args.number-1:
                break


def get_args():
    parser = ArgumentParser(prog="translate.py", description="")
    parser.add_argument('--version', action='version', version='%(prog)s 0.1')

    parser.add_argument("input", help="input file, use - to read from stdin")
    parser.add_argument("-n", "--number", help="number of translations",  type=int, required=True)
    parser.add_argument("-o", "--output", help="output file")

    if len(sys.argv) == 1:
        parser.print_help(sys.stderr)
        sys.exit(1)

    args = parser.parse_args()

    if args.output:
        sys.stdout = open(args.output, "w")

    if args.input != "-":
        sys.stdin = open(args.input, "r")

    return args


if __name__ == '__main__':
    main()

Example for getting max sequences per protein:

$ python translate.py -n 3 protein.fa

But could you please explain for what you need this?

fin swimmer

ADD COMMENT
0
Entering edit mode

Thank you, We are doing a project for python and trying to make some codes.

ADD REPLY
0
Entering edit mode

And for that reason you just make someone else do your assignment?

ADD REPLY
0
Entering edit mode

No, I am not going to use his code I wrote my own. What I do is break down the code to understand it. This is what we wrote.

codons_ = [CODONS[codon] for codon in record.seq] + [CODONS['Stop']]
data[record.id] = ["".join(result) for result in itertools.islice(itertools.product(*codons_, repeat=1),10)]
 print(data)

Have a nice day Wouter.

ADD REPLY
0
Entering edit mode

I think what fin swimmer meant was what is the biological purpose of the code?

Back translation isnt something people really ever do because its computationally massive/slow. There's no code you (or anyone) can write that can make this work fast enough for a 'real' example.

ADD REPLY
0
Entering edit mode

Sorry for late reply. I just had a lot of work. Back translation was one of the questions in my project.

For this back translation: We made a dictionary of the possible translations and added a limit to it. Let's say you can add 10, 20.... possibilities. And then this was used to count the nucleotide from the sequence and nucleotide were plotted.

And thank u also for all your help.

ADD REPLY
1
Entering edit mode

I've moved both of the primary comments in this thread to answers of their own to provide some closure to the question. If the answers helped/resolved your issues, be sure to accept at least one (you can accept as many as you like), using the tick mark at the left of the post.

ADD REPLY

Login before adding your answer.

Traffic: 2596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6