from protein to tRNA combinations
0
0
Entering edit mode
4.0 years ago

Hello, I hope this isnt the wrong website to post this, I'm new to this and I got pretty lost

I have to write a function where there is a protein sequence that needs all possible combinations of tRNAs that it can be converted to, as long as they dont exceed 5000 combinations (e.g. for the sequence “CH” the script’s output would be “UGUCAU”, “UGUCAC”, “UGCCAU”, and “UGCCAC”, a.k.a 4 combinations). And as long as the sequence has an identifier.

I generated this code, but It doesn't seem to be working, I'm very new to python, so it probably has a lot of mistakes :

from Bio import SeqIO
from Bio import SeqUtils
import itertools

def all_combinations (file_name, sequence_ID):
   seq_records = SeqIO.parse(file_name, 'fasta')
   sequence_ID = record.id
   for record in seq_records:
       record = [record.reverse_complement(id="rc_"+record.id, description = "reverse complement")\
                for record in seq_records if len(record)<5000]
   for L in range(0, len(record)+1): 
        subset = itertools.combinations(record, L)
if record.id in seq_record:
    print (subset)  
else:
   print ('error: sequence doesnt have an indentifier')

Im sorry if this is messy or wrong, this is my first time posting anywhere.

Thank you very much for anyone who can help.

RNA-Seq rna-seq • 2.0k views
ADD COMMENT
0
Entering edit mode

The function is not indented, is it like this in the original script?

ADD REPLY
0
Entering edit mode

There is no original script, i started it on my own based on whats required, what do you mean by the function is not indented

ADD REPLY
0
Entering edit mode

I mean that in python you should indent each clause, e.g.:

def foo(fn):
    print(a)
    if (fn>8):
        print(fn)
ADD REPLY
0
Entering edit mode

what about now ? I tried to edit it a little.

ADD REPLY
1
Entering edit mode

The bottom six lines don't look correctly indented.

What Asaf was trying to tell you: in Python the indentation of code blocks is an essential part of the control flow. If you don't indent correctly, the code might work but would potentially do wrong things. In other programming languages this is often dealt with by delimiting code blocks with { and }.

ADD REPLY
0
Entering edit mode

a protein sequence that needs all possible combinations of tRNAs that it can be converted to

I assume you mean codons (or anticodons) instead of tRNAs?

Concerning your code: I'm sorry, but it doesn't seem to make sense. Can you explain what your approach to the problem is and what you think your code is doing?

ADD REPLY
0
Entering edit mode

Oh I'm really sorry, ok... we need to do a reverse function, that goes from the protein amino acids to the tRNA that composed them, therefore we are supposed to have different combinations, like in the example i wrote, so the code takes a fasta file that contains a protein sequence, and the sequence identifier, and its supposed to write all the possible RNA combinations for the sequence in the fasta file, with a condition that the total of combinations should be less than 5000, then it should give the number of the combinations without actually printing them, like for the example it should only write 4.

I'm really thankful, and sorry to pull this, I know It's probably lame, I just thought i could find some help if i can

ADD REPLY
0
Entering edit mode

EDIT: I read your question more thoroughly, and it looks like you're indeed looking for anti-codons (unless your use of reverse complement is by mistake) and you have a cap on the combinations, which is good. Just a heads up: You'll reach 5000 sequences in just about 8 AAs, assuming each AA can be coded by 3 codons.

You're looking for the codons (mRNA triplet sequences), not the "tRNA sequences". The transfer RNA has the anticodon sequence that binds to the mRNA codon sequence - each tRNA brings a unique amino acid that becomes part of the polypeptide chain being assembled.

You'd benefit from having a dictionary where for each amino acid key, you have an array of possible codons. This would mean that there would be a geometric explosion of DNA sequences for any polypeptide sequence as the number of amino acids increases. Say, each AA has 3 codons on average, a 2-aa input will give you 9 different sequences. You will end up with 3n output sequences where n is the length of the input sequence. For a 20 AA peptide, you'd have 3 billion output sequences. Exercise caution - I don't see how you can store these rapidly mutating strings in memory without abusing RAM.

ADD REPLY
0
Entering edit mode

I'm really thankful, and sorry to pull this, I know It's probably lame, I just thought i could find some help if i can

No worries, I just asked for the purpose of understanding how to better help you with this. Have a look at _r_am's answer below (even if it is struck through).

Build yourself said dictionary, e.g.

{"M": ["AUG"], "P": ["CCA", "CCC", "CCG", "CCU"], ...}

and then use this information to calculate the number of possible combinations for each protein sequence. For each protein with < 5000 combinations use the very same dictionary to reconstruct and print the respective sequences. There is no need to use itertools.combinations or similar.

ADD REPLY
0
Entering edit mode

Oh ok, I tried that, did the dictionary, then went to write a function that turns elements of the fasta file into a list, then tried to get combinations out of it based on the dictionary, but I got an error...Im' really sorry to bother you, but can you just check the code I wrote ?

def protein_fasta (protein_file):
protein_sequence = []
protein = SeqIO.parse(protein_file, format = 'fasta')
for Seqrecord in protein: 
    protein_sequence.append(Seqrecord.seq)
print (protein_sequence)

for key in protein_sequence:
     codon = [ list(RNA_codon_table[key]) for key in protein_sequence ]
print(list(itertools.product(codon)))
ADD REPLY
0
Entering edit mode

What's the error you're getting?

I'd do something like this:

codon_table = {"M": ["AUG"], "P": ["CCA", "CCC", "CCG", "CCU"], ...}

for record in protein:
    n_combinations = 0
    for aa in record.seq:
        n_combinations *= codon_table.get(aa, 0)
    if n_combinations <= 5000:
        generate_all_combinations

Think about how you could do the generate_all_combinations step. I don't think you need to use any itertools function. Write it down in pseudocode here and then we can check how that works in Python.

Edit: replaced addition with multiplication (thanks _r_am!)

ADD REPLY
0
Entering edit mode

I think combinations would multiply, not add. n_combinations *= codon_table.get(aa, 0) (if I'm understanding the get function correctly - I don't know python).

One could even map an array of the first 8 AAs to their number of codons and reduce this array with *. If that operation exceeds 5000, skip the sequence. If not, start at length=8 and go one by one. There should also be a cap - say 15 AAs - beyond which you can safely eliminate the sequence without even needing to calculate the number of combinations.

ADD REPLY
1
Entering edit mode

Ouch, of course, it should be *, good spot, thanks!

Just, fyi codon_table.get(aa, 0) returns 0 if aa is not in the table instead of erroring out.

Re your other suggestions -- yea, second those! Except if we have an M-polymer or W-polymer with length < 5001 :)

ADD REPLY
0
Entering edit mode

Ah, I see. Thanks for the explanation. I wonder why OP is looking for the anti-codon sequence and not the codon sequence though.

BTW I googled a bit and there exists JS code (from the Sequence Manipulation Suite) that does what OP wants and adds in a probability parameter based on codon usage statistics. Maybe OP would want to implement the logic in python: https://github.com/paulstothard/sequence_manipulation_suite/blob/master/docs/scripts/multi_rev_trans.js

ADD REPLY
0
Entering edit mode

Maybe it's an extra level of complexity in the task (or a misunderstanding).

ADD REPLY

Login before adding your answer.

Traffic: 2216 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6