Question: convert DNA FASTA to protein FASTA format (faa)
gravatar for nkuo
3.3 years ago by
nkuo20 wrote:


I have de novo assembled whole genome DNA sequences and an embl annotation file that contains CDS regions for multiple genes. I am wondering if there are any tools that can first, parse the embl annotation file to obtain the CDS positions. Then use these CDS genome positions extracted from the embl file to convert each DNA fasta file into multifasta files containing the translated amino acid sequence (.faa) for every CDS range given in the embl file for every isolate.


sequence • 3.2k views
ADD COMMENTlink written 3.3 years ago by nkuo20

It's the kind of tool that bioinformaticians usually brew theirselves. I am not aware of tools doing it, but in principle it's quite easy to achieve as long as you have a genetic code in the form of a dictionary / hash.

E.G. for python:

TRANSLATE = {"UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L",
    "UCU":"S", "UCC":"S", "UCA":"S", "UCG":"S",
    "UAU":"Y", "UAC":"Y", "UAA":"STOP", "UAG":"STOP",
    "UGU":"C", "UGC":"C", "UGA":"STOP", "UGG":"W",
    "CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L",
    "CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P",
    "CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q",
    "CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R",
    "AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M",
    "ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T",
    "AAU":"N", "AAC":"N", "AAA":"K", "AAG":"K",
    "AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R",
    "GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V",
    "GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A",
    "GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E",
    "GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G",}
ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Macspider3.1k

If you already have the CDS multifasta file, translating it is fairly trivial. Do you already have this or need to generate it from the EMBL file?

ADD REPLYlink written 3.3 years ago by Joe17k

No, we do not have CDS multifasta file yet, we need to generate from EMBL file

ADD REPLYlink written 3.3 years ago by nkuo20
gravatar for Joe
3.3 years ago by
United Kingdom
Joe17k wrote:

Something like this? Extracting All Cds From A Embl File

Biopython and other parsers should be able to handle EMBL format. It'll store all the CDS features in a seqrecord once you parse it in with (probably) SeqIO. You can then just get it to spit out all the CDS protein features.

I answered basically this task before here A: How do can I use Biopython and SeqIO to parse out multiple genes from several NC (see the # script). You could maybe even consider first converting your EMBL file to a Genbank and then all the scripts in that second thread would work,

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Joe17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1421 users visited in the last hour