Question: Beginner in Python- need in assistance translating DNA into protein sequence
0
gravatar for oki4
2.3 years ago by
oki410
oki410 wrote:

Homework Problem: Write a Python program that translates a DNA sequences into a protein sequences. In other words, your script should convert all codons into amino acids. For this problem, the input file is in FASTA format with at least one sequence. You may NOT assume that the sequence length is always a multiple of 3 or that sequences will have the same length. Use the codon2aa map given in the code examples above. You will first have to transcribe the DNA sequence to RNA before being able to use it.

Program so far:

def transcribe(sequence):
    for ch in seq:
        rna_seq = seq.replace('T', 'U')
        return(rna_seq)

def translate_dna(sequence):
    codon2aa = {"AAA":"K", "AAC":"N", "AAG":"K", "AAU":"N", 
                "ACA":"T", "ACC":"T", "ACG":"T", "ACU":"T", 
                "AGA":"R", "AGC":"S", "AGG":"R", "AGU":"S", 
                "AUA":"I", "AUC":"I", "AUG":"M", "AUU":"I", 

                "CAA":"Q", "CAC":"H", "CAG":"Q", "CAU":"H", 
                "CCA":"P", "CCC":"P", "CCG":"P", "CCU":"P", 
                "CGA":"R", "CGC":"R", "CGG":"R", "CGU":"R", 
                "CUA":"L", "CUC":"L", "CUG":"L", "CUU":"L", 

                "GAA":"E", "GAC":"D", "GAG":"E", "GAU":"D", 
                "GCA":"A", "GCC":"A", "GCG":"A", "GCU":"A", 
                "GGA":"G", "GGC":"G", "GGG":"G", "GGU":"G", 
                "GUA":"V", "GUC":"V", "GUG":"V", "GUU":"V", 

                "UAA":"_", "UAC":"Y", "UAG":"_", "UAU":"T", 
                "UCA":"S", "UCC":"S", "UCG":"S", "UCU":"S", 
                "UGA":"_", "UGC":"C", "UGG":"W", "UGU":"C", 
                "UUA":"L", "UUC":"F", "UUG":"L", "UUU":"F"}

    ##the find method will give you the first index position (int) of the codon you're referring to 
    start_codon = sequence.find('ATG')

    ##You create a string of the starting nucleotide (A in this case) and rest of the sequence 
    seq_start = sequence[int(start_codon):]
    ##This selects the first nucleotide in codon/starting with with zero, selects for multiples of 3; n are multiples of 3 starting with 0
    for n in (0, len(seq_start), 3):
        if seq_start[n:n+3] in codon2aa:
            protein_seq += codon2aa[seq_start[n:n+3]]
    return protein_seq


seq = ''
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            transcribe(line)
            translate_dna(line)
            seq += (line)
print(seq)

Error message I receive: Traceback (most recent call last): File "/Users/Oyinade/Documents/bnfo135/Homework2.py", line 52, in <module> translate_dna(line) File "/Users/Oyinade/Documents/bnfo135/Homework2.py", line 42, in translate_dna return protein_seq UnboundLocalError: local variable 'protein_seq' referenced before assignment

I don't what I'm doing wrong my code, can someone let me know if I'm heading in the correct decision. Thanks.

python • 12k views
ADD COMMENTlink modified 2.3 years ago by a.zielezinski8.6k • written 2.3 years ago by oki410

The last part of my program is in the wrong format, I don't why it did that but here it is correctly formatted:

seq = ""
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            transcribe(line)
            translate_dna(line)
            seq += (line)
print(seq)
ADD REPLYlink modified 2.3 years ago by WouterDeCoster37k • written 2.3 years ago by oki410
4
gravatar for a.zielezinski
2.3 years ago by
a.zielezinski8.6k
a.zielezinski8.6k wrote:

File: dna.fasta

>seq1
ATGCTGATGATAGGTATGGGTA
GATAGATGAGAGAGATGAGAAT
>seq2
ATGCGATGATAGATG
>seq3
ATGC

File: homework.py

def transcribe(sequence):
    rna_seq = sequence.replace('T', 'U')
    return(rna_seq)

def translate_rna(sequence):
    codon2aa = {"AAA":"K", "AAC":"N", "AAG":"K", "AAU":"N", 
                "ACA":"T", "ACC":"T", "ACG":"T", "ACU":"T", 
                "AGA":"R", "AGC":"S", "AGG":"R", "AGU":"S", 
                "AUA":"I", "AUC":"I", "AUG":"M", "AUU":"I", 

                "CAA":"Q", "CAC":"H", "CAG":"Q", "CAU":"H", 
                "CCA":"P", "CCC":"P", "CCG":"P", "CCU":"P", 
                "CGA":"R", "CGC":"R", "CGG":"R", "CGU":"R", 
                "CUA":"L", "CUC":"L", "CUG":"L", "CUU":"L", 

                "GAA":"E", "GAC":"D", "GAG":"E", "GAU":"D", 
                "GCA":"A", "GCC":"A", "GCG":"A", "GCU":"A", 
                "GGA":"G", "GGC":"G", "GGG":"G", "GGU":"G", 
                "GUA":"V", "GUC":"V", "GUG":"V", "GUU":"V", 

                "UAA":"_", "UAC":"Y", "UAG":"_", "UAU":"T", 
                "UCA":"S", "UCC":"S", "UCG":"S", "UCU":"S", 
                "UGA":"_", "UGC":"C", "UGG":"W", "UGU":"C", 
                "UUA":"L", "UUC":"F", "UUG":"L", "UUU":"F"}

    protein_seq = ''
    for n in range(0, len(sequence), 3):
        if sequence[n:n+3] in codon2aa:
            protein_seq += codon2aa[sequence[n:n+3]]
    return protein_seq


ids, sequences = [], []
n = -1
with open('dna.fasta') as fh:
    for line in fh:
        line = line.strip()
        if line[0] == '>':
            id = line.split()[0][1:]
            ids.append(id)
            sequences.append('')
            n += 1
        else:
            sequences[n] += line


for id, seq in zip(ids, sequences):
    print('>'+id)
    rna = transcribe(seq)
    protein = translate_rna(rna)
    print(protein)

File: homework1.py (for WouterDeCoster)

from itertools import groupby

def transcribe(sequence):
    return sequence.replace('T', 'U')

def translate_rna(s):
    codon2aa = {"AAA":"K", "AAC":"N", "AAG":"K", "AAU":"N", 
                "ACA":"T", "ACC":"T", "ACG":"T", "ACU":"T", 
                "AGA":"R", "AGC":"S", "AGG":"R", "AGU":"S", 
                "AUA":"I", "AUC":"I", "AUG":"M", "AUU":"I", 

                "CAA":"Q", "CAC":"H", "CAG":"Q", "CAU":"H", 
                "CCA":"P", "CCC":"P", "CCG":"P", "CCU":"P", 
                "CGA":"R", "CGC":"R", "CGG":"R", "CGU":"R", 
                "CUA":"L", "CUC":"L", "CUG":"L", "CUU":"L", 

                "GAA":"E", "GAC":"D", "GAG":"E", "GAU":"D", 
                "GCA":"A", "GCC":"A", "GCG":"A", "GCU":"A", 
                "GGA":"G", "GGC":"G", "GGG":"G", "GGU":"G", 
                "GUA":"V", "GUC":"V", "GUG":"V", "GUU":"V", 

                "UAA":"_", "UAC":"Y", "UAG":"_", "UAU":"T", 
                "UCA":"S", "UCC":"S", "UCG":"S", "UCU":"S", 
                "UGA":"_", "UGC":"C", "UGG":"W", "UGU":"C", 
                "UUA":"L", "UUC":"F", "UUG":"L", "UUU":"F"}

    l = [codon2aa.get(s[n:n+3], 'X') for n in range(0, len(s), 3)]
    return "".join(l)


with open('dna.fasta') as h:
    faiter = (x[1] for x in groupby(h, lambda l: l[0] == ">"))
    for header in faiter:
        header = next(header)[1:].strip()
        seq = "".join(s.strip() for s in next(faiter))
        print(header)
        print(translate_rna(transcribe(seq)))

Result:

>seq1 
MLMIGMGR_MREMRX
>seq2 
MR__M
>seq3 
MX
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by a.zielezinski8.6k
    if sequence[n:n+3] in codon2aa:
        protein_seq += codon2aa[sequence[n:n+3]]

I would do this with a try-except block, now it's a potential silent error...

And why don't you translate the sequences while looping over the input file? No need to store the data in lists...

ADD REPLYlink written 2.3 years ago by WouterDeCoster37k

You are right. But hey, it's a homework, right? I would never write it as I did here.

ADD REPLYlink written 2.3 years ago by a.zielezinski8.6k

Good point, agreed. This code needs more biopython.

ADD REPLYlink written 2.3 years ago by WouterDeCoster37k

I edited my answer. What do you think?

ADD REPLYlink written 2.3 years ago by a.zielezinski8.6k

Looks pretty and probably gets the job done nicely. If you don't mind, some nitpicking comments just because I have 5 minutes until my RNA is ready for the next step in my protocol:

  • I would use return(protein_seq) to be python3 - compatible.
  • I would parse the fasta file using Biopython SeqIO
  • I would use Biopython translate (which also works on DNA, avoiding the transcription step)
  • I would directly print the result without assigning (unnecessary assignments take time and memory). Difference will be very very small in this case but I'm being pedantic here

This block

rna = transcribe(seq)
protein = translate_rna(rna)
print(protein)

could be

print(translate_rna(transcribe(seq)))
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by WouterDeCoster37k
2

Thanks! Good points except the first one :). I think return value is the recommended way in Python2-3 (if you are only returning 1 value). After all, return is a language construct, not a function. Even if you want to return a tuple, there's no need for any parenthesis. Interesting thing, parsing FASTA sequences using Bio.SeqIO is slow (a lot of if statements are performed).

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by a.zielezinski8.6k
1

Right, I was wrong about return I see. But I don't like the way it looks without parenthesis ;-(

I can imagine that SeqIO is quite slow since it probably doesn't make assumptions about the expected fasta-format but instead checks everything carefully. After all, incorrect user-input is the biggest enemy of every tool. Perhaps your own fasta-parser is faster and fine, but to be 100% sure I would probably sacrifice a bit of execution speed (a comparison could be interesting).

I love how we got carried away after this homework question. Passionate about Python!

ADD REPLYlink written 2.3 years ago by WouterDeCoster37k

Is it possible to have the code to transcribe within the code to translate?

ADD REPLYlink written 3 months ago by mito_ninja0
2
gravatar for WouterDeCoster
2.3 years ago by
Belgium
WouterDeCoster37k wrote:

This is a homework question so try harder ;-) I formatted your code for readability, please use the 101010 button next time.

The error message states exactly what's wrong: local variable 'protein_seq' referenced before assignment It even gives you the line where things went wrong. Python's error messages are great, take your time to read them and you'll solve 99% of errors.

ADD COMMENTlink written 2.3 years ago by WouterDeCoster37k

Ok I created an variable (protein_seq) containing an empty string in the translate_dna function. When I run the program, it does not transcribe and translate the sequence. Am I calling the function incorrectly? Thanks.

seq = ""
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            transcribe(line)
            translate_dna(line)
            seq += (line)
print(seq)
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by oki410

The transcribe and translate function are called correctly but you don't use their results.

seq = ""
with open('dna.fasta.py') as f_in:
    for line in f_in.readlines():
        if line[0] == '>':
            seq += (line.strip('>') + '     ')      
        else:
            rna = transcribe(line)
            protein = translate_dna(rna)
            seq += (protein)
print(seq)

You just kept using "line" but it wasn't changed since you didn't use what was returned by the functions

ADD REPLYlink written 2.3 years ago by WouterDeCoster37k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1063 users visited in the last hour