Question

convert DNA FASTA to protein FASTA format (faa)

1

Entering edit mode

8.2 years ago

nkuo ▴ 30

Evening.

I have de novo assembled whole genome DNA sequences and an embl annotation file that contains CDS regions for multiple genes. I am wondering if there are any tools that can first, parse the embl annotation file to obtain the CDS positions. Then use these CDS genome positions extracted from the embl file to convert each DNA fasta file into multifasta files containing the translated amino acid sequence (.faa) for every CDS range given in the embl file for every isolate.

Thanks

sequence • 7.7k views

ADD COMMENT • link 8.2 years ago by nkuo ▴ 30

0

Entering edit mode

It's the kind of tool that bioinformaticians usually brew theirselves. I am not aware of tools doing it, but in principle it's quite easy to achieve as long as you have a genetic code in the form of a dictionary / hash.

E.G. for python:

TRANSLATE = {"UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L",
    "UCU":"S", "UCC":"S", "UCA":"S", "UCG":"S",
    "UAU":"Y", "UAC":"Y", "UAA":"STOP", "UAG":"STOP",
    "UGU":"C", "UGC":"C", "UGA":"STOP", "UGG":"W",
    "CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L",
    "CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P",
    "CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q",
    "CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R",
    "AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M",
    "ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T",
    "AAU":"N", "AAC":"N", "AAA":"K", "AAG":"K",
    "AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R",
    "GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V",
    "GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A",
    "GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E",
    "GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G",}

ADD REPLY • link 8.2 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

If you already have the CDS multifasta file, translating it is fairly trivial. Do you already have this or need to generate it from the EMBL file?

ADD REPLY • link 8.2 years ago by Joe 22k

0

Entering edit mode

No, we do not have CDS multifasta file yet, we need to generate from EMBL file

ADD REPLY • link 8.2 years ago by nkuo ▴ 30

score 1 · Answer 1 · 2017-04-18

Something like this? Extracting All Cds From A Embl File

Biopython and other parsers should be able to handle EMBL format. It'll store all the CDS features in a seqrecord once you parse it in with (probably) SeqIO. You can then just get it to spit out all the CDS protein features.

I answered basically this task before here A: How do can I use Biopython and SeqIO to parse out multiple genes from several NC (see the # genbank2fasta.py script). You could maybe even consider first converting your EMBL file to a Genbank and then all the scripts in that second thread would work,