DNA to peptide translation and adding to csv file
2
0
Entering edit mode
2.6 years ago
Sara ▴ 240

I have a big csv file with DNA sequences like this small example (infile.csv):

2840,GTGGCCCGGGAGGCC
291,GCATGTCCGTAGGTTCGT
147,GCATGTCCG

I need to translate each DNA sequence to peptide sequence and add the 3rd column which will be the peptide sequence. here is the expected output:

2840,GTGGCCCGGGAGGCC,VAREA
291,GCATGTCCGTAGGTTCGT,ACP*VR
147,GCATGTCCG,ACP

to do so, I made small following code:

import pandas
df = pandas.read_csv('infile.csv')
seq = csv_data[1]

def translate(seq):
    table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
        'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
    }
    protein =""
    if len(seq)%3 == 0:
        for i in range(0, len(seq), 3):
            codon = seq[i:i + 3]
            protein+= table[codon]
    return protein


peptide_seq=translate(seq)
df[peptide_seq]
df.to_csv("outfile.csv")

but it does not return the expected output. do you know how I can change the code to get the expected output?

translation • 988 views
ADD COMMENT
2
Entering edit mode
2.6 years ago
Mensur Dlakic ★ 27k

There are several assignment errors above that I won't discuss in detail - you should be able to figure it out from the corrected code below.

import pandas
df = pandas.read_csv('infile.csv', header=None)
df.columns = ['index','DNA_sequence']
seq = df['DNA_sequence']

def translate(seq):
    table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
        'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
    }
    protein = ''
    if len(seq)%3 == 0:
        for i in range(0, len(seq), 3):
            codon = seq[i:i + 3]
            protein += table[codon]
    return protein

peptide_seq = []
for j in range(len(seq)):
    peptide_seq.append(translate(seq[j]))
df['peptide_sequence'] = peptide_seq
df.to_csv('outfile.csv', index=False)
ADD COMMENT
2
Entering edit mode
2.6 years ago
$ echo -e "2840,GTGGCCCGGGAGGCC\n291,GCATGTCCGTAGGTTCGT\n147,GCATGTCCG"  | awk -F, 'function translate(S) {L=length(S);P="";for(i=1;i+2<=L;i+=3) P=sprintf("%s%s",P,h[substr(S,i,3)]);return P;} BEGIN{h["AAA"]="K";h["AAC"]="N";h["AAG"]="K";h["AAT"]="N";h["ACA"]="T";h["ACC"]="T";h["ACG"]="T";h["ACT"]="T";h["AGA"]="R";h["AGC"]="S";h["AGG"]="R";h["AGT"]="S";h["ATA"]="I";h["ATC"]="I";h["ATG"]="M";h["ATT"]="I";h["CAA"]="Q";h["CAC"]="H";h["CAG"]="Q";h["CAT"]="H";h["CCA"]="P";h["CCC"]="P";h["CCG"]="P";h["CCT"]="P";h["CGA"]="R";h["CGC"]="R";h["CGG"]="R";h["CGT"]="R";h["CTA"]="L";h["CTC"]="L";h["CTG"]="L";h["CTT"]="L";h["GAA"]="E";h["GAC"]="D";h["GAG"]="E";h["GAT"]="D";h["GCA"]="A";h["GCC"]="A";h["GCG"]="A";h["GCT"]="A";h["GGA"]="G";h["GGC"]="G";h["GGG"]="G";h["GGT"]="G";h["GTA"]="V";h["GTC"]="V";h["GTG"]="V";h["GTT"]="V";h["TAA"]="*";h["TAC"]="Y";h["TAG"]="*";h["TAT"]="Y";h["TCA"]="S";h["TCC"]="S";h["TCG"]="S";h["TCT"]="S";h["TGA"]="*";h["TGC"]="C";h["TGG"]="W";h["TGT"]="C";h["TTA"]="L";h["TTC"]="F";h["TTG"]="L";h["TTT"]="F";} {OFS=",";print $1,$2,translate($2);}'

2840,GTGGCCCGGGAGGCC,VAREA
291,GCATGTCCGTAGGTTCGT,ACP*VR
147,GCATGTCCG,ACP
ADD COMMENT

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6