Question: How to back translate the protein into amino acids sequence with python?
0
gravatar for macadamopetrus
8 days ago by
macadamopetrus0 wrote:

Dear all, I am trying to translate protein back into RNA sequences. And when doing so, this kind of back translation gives a lot of RNA sequences.

So for that, I wrote a dictionary of all the possibilities present. And then trying to back translate it with python. But the first problem is I don't want to blow up my computer by printing out the result. For that, I made a mock fasta file containing a mock sequence and try to run this. This file contains MPG and MPPP as protein sequences: -But this doesn't yield the result as PPP 3 times will be seen as 1 time? Suggestions anyone?

Thank u

ADD COMMENTlink modified 8 days ago • written 8 days ago by macadamopetrus0

Please correct you title, I believe you mean How to back back translate the protein into RNA sequence with python?

Show the code you wrote so far.

ADD REPLYlink written 8 days ago by h.mon22k

ok

My mock protein are in fasta file called HIJ2. fasta. Protein sequences are MPG and MPPP.

    amino_x = {   
    'M': ['GCU', 'GCC', 'GCA'], 
    'P': ['UGU','UGC'],
    'G': ['AAA']
     }

from Bio import SeqIO
for protein in SeqIO.parse("HIJ2.fasta","fasta"): 
    for i in protein:
        amino_acids = list()

        for value in protein:
            amino_acids.append(amino_x[value])
    print(amino_acids)

from itertools import product
    list(itertools.product(amino_acids))

This code is with list.

ADD REPLYlink modified 8 days ago by genomax60k • written 8 days ago by macadamopetrus0

Sorry, I tried to put the code inside the code kernel but this doesn't work.

ADD REPLYlink written 8 days ago by macadamopetrus0

Like this?

from Bio import SeqIO
import itertools

amino_x = {
    'M': ['GCU', 'GCC', 'GCA'],
    'P': ['UGU','UGC'],
    'G': ['AAA']
 }

for protein in SeqIO.parse("HIJ2.fasta","fasta"):
    printprotein.id)
    #    ^ there must be a ( after print. Due to a known bug in this forum software it is not shown.
    amino_acids = list()

    for value in protein.seq:
        amino_acids.append(amino_x[value])

    for translate_back in itertools.product(*amino_acids):
        print(translate_back)

Output:

A
('GCU', 'UGU', 'AAA')
('GCU', 'UGC', 'AAA')
('GCC', 'UGU', 'AAA')
('GCC', 'UGC', 'AAA')
('GCA', 'UGU', 'AAA')
('GCA', 'UGC', 'AAA')
B
('UGU', 'UGU', 'UGU')
('UGU', 'UGU', 'UGC')
('UGU', 'UGC', 'UGU')
('UGU', 'UGC', 'UGC')
('UGC', 'UGU', 'UGU')
('UGC', 'UGU', 'UGC')
('UGC', 'UGC', 'UGU')
('UGC', 'UGC', 'UGC')

HIJ2.fasta

>A
MPG
>B
PPP
ADD REPLYlink modified 8 days ago • written 8 days ago by finswimmer8.8k

Can this code work for longer protein sequences? And how can I add some code that will only print out, lets say 3 rna sequences and not all of them.

ADD REPLYlink modified 7 days ago • written 7 days ago by macadamopetrus0
1
gravatar for jrj.healey
8 days ago by
jrj.healey9.7k
United Kingdom
jrj.healey9.7k wrote:

You haven't said what the error was with your code, but the following subtly modified version works for me:

Input:

>Pep1
PPP
>Pep2
MPG

Code:

import sys
from Bio import SeqIO
from itertools import product

amino_x = {'M': ['GCU', 'GCC', 'GCA'],
           'P': ['UGU','UGC'],
           'G': ['AAA']}

for protein in SeqIO.parse(sys.argv[1], "fasta"):
    for i in protein:
        amino_acids = []
        for residue in protein:
            amino_acids.append(amino_x[residue])
    print("{} -> {}".formatprotein.id, amino_acids))

    for a in product(*amino_acids):
        print(a)

Yields:

Pep1 -> [['UGU', 'UGC'], ['UGU', 'UGC'], ['UGU', 'UGC']]
('UGU', 'UGU', 'UGU')
('UGU', 'UGU', 'UGC')
('UGU', 'UGC', 'UGU')
('UGU', 'UGC', 'UGC')
('UGC', 'UGU', 'UGU')
('UGC', 'UGU', 'UGC')
('UGC', 'UGC', 'UGU')
('UGC', 'UGC', 'UGC')
Pep2 -> [['GCU', 'GCC', 'GCA'], ['UGU', 'UGC'], ['AAA']]
('GCU', 'UGU', 'AAA')
('GCU', 'UGC', 'AAA')
('GCC', 'UGU', 'AAA')
('GCC', 'UGC', 'AAA')
('GCA', 'UGU', 'AAA')
('GCA', 'UGC', 'AAA')

Is that not what you were looking for?


Just FYI, for a long sequence, this back translation is going to get huge.

For a peptide of just 19 residues, the run time is already over 5 seconds (there is a reason BioPython doesn't offer back translation natively):

real    0m5.169s
user    0m0.568s
sys     0m0.376s

For 30 residues this climbs to over 48 minutes! - and I have no idea how much longer it would have gone on for, I killed it after I got bored!

  File "backtranslate.py", line 17, in <module>
    print(a)
KeyboardInterrupt

real    48m41.443s
user    9m37.704s
sys     6m38.621s

And this only considering 3 amino acids with a total of 6 codons. This will simply be incalculable if you want to do anything more complicated.

ADD COMMENTlink modified 8 days ago • written 8 days ago by jrj.healey9.7k

I know it takes a lot of time to run these sequences. That's why I might add some code to print only 3 translations of each protein. But still translate all and print only lets say 3.

ADD REPLYlink modified 7 days ago • written 7 days ago by macadamopetrus0

And I tried it with itertool as well, its much faster and print out the possible results very fast.

ADD REPLYlink written 7 days ago by macadamopetrus0

If you are happy with this notation:

Pep2 -> [['GCU', 'GCC', 'GCA'], ['UGU', 'UGC'], ['AAA']]

Instead of needing all the cartesian products:

('GCU', 'UGU', 'AAA')
('GCU', 'UGC', 'AAA')
('GCC', 'UGU', 'AAA')
('GCC', 'UGC', 'AAA')
('GCA', 'UGU', 'AAA')
('GCA', 'UGC', 'AAA')

It will be much faster.

I don't think you fully appreicate how big the task is though. To enumerate all the possibilities for a single real world protein, is likely weeks or months of computation.

ADD REPLYlink modified 7 days ago • written 7 days ago by jrj.healey9.7k
0
gravatar for finswimmer
7 days ago by
finswimmer8.8k
Germany
finswimmer8.8k wrote:

You have to break the for after you reached the number of sequences you like to print.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
from argparse import ArgumentParser
from Bio import SeqIO
import itertools


def main():
    args = get_args()

    amino_x = {
        'M': ['GCU', 'GCC', 'GCA'],
        'P': ['UGU','UGC'],
        'G': ['AAA']
    }

    for protein in SeqIO.parse(args.input,"fasta"):
        printprotein.id)
        #    ^ there must be a ( after print. Due to a known bug in this forum software it is not shown.
        amino_acids = list()

        for value in protein.seq:
            amino_acids.append(amino_x[value])

        for i, translate_back in enumerate(itertools.product(*amino_acids)):
            print(translate_back)

            if i == args.number-1:
                break


def get_args():
    parser = ArgumentParser(prog="translate.py", description="")
    parser.add_argument('--version', action='version', version='%(prog)s 0.1')

    parser.add_argument("input", help="input file, use - to read from stdin")
    parser.add_argument("-n", "--number", help="number of translations",  type=int, required=True)
    parser.add_argument("-o", "--output", help="output file")

    if len(sys.argv) == 1:
        parser.print_help(sys.stderr)
        sys.exit(1)

    args = parser.parse_args()

    if args.output:
        sys.stdout = open(args.output, "w")

    if args.input != "-":
        sys.stdin = open(args.input, "r")

    return args


if __name__ == '__main__':
    main()

Example for getting max sequences per protein:

$ python translate.py -n 3 protein.fa

But could you please explain for what you need this?

fin swimmer

ADD COMMENTlink modified 7 days ago • written 7 days ago by finswimmer8.8k

Thank you, We are doing a project for python and trying to make some codes.

ADD REPLYlink written 7 days ago by macadamopetrus0

And for that reason you just make someone else do your assignment?

ADD REPLYlink written 7 days ago by WouterDeCoster35k

No, I am not going to use his code I wrote my own. What I do is break down the code to understand it. This is what we wrote.

codons_ = [CODONS[codon] for codon in record.seq] + [CODONS['Stop']]
data[record.id] = ["".join(result) for result in itertools.islice(itertools.product(*codons_, repeat=1),10)]
 print(data)

Have a nice day Wouter.

ADD REPLYlink modified 7 days ago • written 7 days ago by macadamopetrus0

I think what fin swimmer meant was what is the biological purpose of the code?

Back translation isnt something people really ever do because its computationally massive/slow. There's no code you (or anyone) can write that can make this work fast enough for a 'real' example.

ADD REPLYlink written 7 days ago by jrj.healey9.7k

Sorry for late reply. I just had a lot of work. Back translation was one of the questions in my project.

For this back translation: We made a dictionary of the possible translations and added a limit to it. Let's say you can add 10, 20.... possibilities. And then this was used to count the nucleotide from the sequence and nucleotide were plotted.

And thank u also for all your help.

ADD REPLYlink modified 6 days ago • written 6 days ago by macadamopetrus0
1

I've moved both of the primary comments in this thread to answers of their own to provide some closure to the question. If the answers helped/resolved your issues, be sure to accept at least one (you can accept as many as you like), using the tick mark at the left of the post.

ADD REPLYlink written 6 days ago by jrj.healey9.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1210 users visited in the last hour