convert DNA FASTA to protein FASTA format (faa)
1
1
Entering edit mode
7.0 years ago
nkuo ▴ 30

Evening.

I have de novo assembled whole genome DNA sequences and an embl annotation file that contains CDS regions for multiple genes. I am wondering if there are any tools that can first, parse the embl annotation file to obtain the CDS positions. Then use these CDS genome positions extracted from the embl file to convert each DNA fasta file into multifasta files containing the translated amino acid sequence (.faa) for every CDS range given in the embl file for every isolate.

Thanks

sequence • 6.5k views
ADD COMMENT
0
Entering edit mode

It's the kind of tool that bioinformaticians usually brew theirselves. I am not aware of tools doing it, but in principle it's quite easy to achieve as long as you have a genetic code in the form of a dictionary / hash.

E.G. for python:

TRANSLATE = {"UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L",
    "UCU":"S", "UCC":"S", "UCA":"S", "UCG":"S",
    "UAU":"Y", "UAC":"Y", "UAA":"STOP", "UAG":"STOP",
    "UGU":"C", "UGC":"C", "UGA":"STOP", "UGG":"W",
    "CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L",
    "CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P",
    "CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q",
    "CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R",
    "AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M",
    "ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T",
    "AAU":"N", "AAC":"N", "AAA":"K", "AAG":"K",
    "AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R",
    "GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V",
    "GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A",
    "GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E",
    "GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G",}
ADD REPLY
0
Entering edit mode

If you already have the CDS multifasta file, translating it is fairly trivial. Do you already have this or need to generate it from the EMBL file?

ADD REPLY
0
Entering edit mode

No, we do not have CDS multifasta file yet, we need to generate from EMBL file

ADD REPLY
1
Entering edit mode
7.0 years ago
Joe 21k

Something like this? Extracting All Cds From A Embl File

Biopython and other parsers should be able to handle EMBL format. It'll store all the CDS features in a seqrecord once you parse it in with (probably) SeqIO. You can then just get it to spit out all the CDS protein features.

I answered basically this task before here A: How do can I use Biopython and SeqIO to parse out multiple genes from several NC (see the # genbank2fasta.py script). You could maybe even consider first converting your EMBL file to a Genbank and then all the scripts in that second thread would work,

ADD COMMENT

Login before adding your answer.

Traffic: 2584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6