Question

How to convert a multiFASTA file with DNA (nucleotide) sequences into a multiFASTA with protein (amino acid) sequences?

0

Entering edit mode

6.2 years ago

Alec Watanabe ▴ 60

Hello guys,

I have a multiFASTA file with DNA (nucleotide) sequences and I want to convert these sequences to protein (AA) sequences. I need to do this conversion because MEDpipe (a webtool) apparently only accepts FASTA file with AA sequences (I presume), because I keep getting errors when I submit the file. Since this multiFASTA file has over 7000 sequences, it would be really time consuming to translate one sequence each time (most webtools I've found only accepts one sequence per time). Is there any way to do this? Here is an example of the file I have:

>Proteus_mirabilis_ARLG2970_2781
ATGGAGACAGGTACAGTAAAGTGGTTCAATAATGCTAAGGGCTTTGGTTTTATTACCCCAGCAAACGGTG
GCGAAGATATTTTTGCCCACTATTCAACAATTAGAATGGAAGGCTACCGCACACTTAAAGCGGGGCAGAA
AGTTAATTATAGCACGATAAAAGGGCCTAAAGGTGACCATACTGACCTTATCATTCCTATCATTGAATAG
>Proteus_mirabilis_ARLG2970_0131
ATGTCTGACAAAATGAAAGGTCAAGTTAAGTGGTTCAACGAGTCTAAAGGCTTTGGTTTTATTACTCCAG
CAGACGGAAGCAAAGACGTATTCGTTCACTTTTCTGCCATTCAAGGTAACGGTTTCAAAACTCTGGCTGA
AGGTCAGAACGTAGAATTCACAATTGAAAACGGTGCAAAAGGTCCAGCAGCAGCTAACGTAACAGCTCTG
TAA
>Proteus_penneri_ATCC35198_1543
TTACAGAGCAGTTACGTTAGCAGCTGCTGGACCTTTTGCACCGTTTTCAATTGTGAATTCTACGTTCTGA
CCTTCAGCCAGAGTTTTGAAACCGTTACCTTGAATGGCAGAAAAGTGAACGAATACGTCTTTGCTTCCGT
CTGCTGGAGTAATAAAACCAAAGCCTTTAGACTCGTTGAACCACTTAACTTGACCTTTCATTTTGTCAGA
CAT
>Proteus_vulgaris_FDAARGOS366_2819
TTAGAGAGCCACCACGTTGCCTGCTGCTGGGCCTTTCATACCATTTTCCATGGTGAATGAAACTTGTTGC
CCTTCAGCTAATGTTTTGAAGCTATCACTTTGGATTGCAGAGAAATGTACGAATACATCTTTGCTGCCAT
CAGCTGGAGTAATAAAACCAAAACCTTTACCTTCATCGAACCATTTTACTGTACCAGTCATTGTATTAGA
CAT
>Proteus_mirabilis_ARLG2970_2695
TTACAGAGCGATTACGTTCGCTGCTGCAGGGCCTTTAGCGCCATTTTCAATAGAAAATGAAACTTCTTGG
CCTTCTTTCAGTGACTTGAAGCTTTCACTTTGGATCGCTGAAAAGTGTACGAATACGTCTTTGCTACCGT
CTTTAGGAGTGATAAAACCGAAGCCTTTATCATCGTTAAACCATTTTACTGTACCAGTCATTGTATTAGA
CAT

protein DNA multiFASTA • 5.2k views

ADD COMMENT • link updated 9 months ago by Ram 43k • written 6.2 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

For future reference: highlight text you want to code and then click on the 101 button in editor window to format it correctly.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Thank you for the tip, I'm new to the community so I'm still learning, but I'll apply this knowledge now in my future posts!

ADD REPLY • link 6.2 years ago by Alec Watanabe ▴ 60

score 0 · Answer 1 · 2018-02-05

0

Entering edit mode

6.2 years ago

GenoMax 141k

You can use EMBOSS transeq via web interface or by downloading/installing EMBOSS.

I got the following using your sequences via web interface in one pass:

>Proteus_mirabilis_ARLG2970_2781_1
METGTVKWFNNAKGFGFITPANGGEDIFAHYSTIRMEGYRTLKAGQKVNYSTIKGPKGDH
TDLIIPIIE*
>Proteus_mirabilis_ARLG2970_0131_1
MSDKMKGQVKWFNESKGFGFITPADGSKDVFVHFSAIQGNGFKTLAEGQNVEFTIENGAK
GPAAANVTAL*
>Proteus_penneri_ATCC35198_1543_1
LQSSYVSSCWTFCTVFNCEFYVLTFSQSFETVTLNGRKVNEYVFASVCWSNKTKAFRLVE
PLNLTFHFVRH
>Proteus_vulgaris_FDAARGOS366_2819_1
LESHHVACCWAFHTIFHGE*NLLPFS*CFEAITLDCREMYEYIFAAISWSNKTKTFTFIE
PFYCTSHCIRH
>Proteus_mirabilis_ARLG2970_2695_1
LQSDYVRCCRAFSAIFNRK*NFLAFFQ*LEAFTLDR*KVYEYVFATVFRSDKTEAFIIVK
PFYCTSHCIRH

ADD COMMENT • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Dear genomax,

Thank you for your response. Do you know if downloading/installing EMBOSS will let me use any file size? The complete multiFASTA file I intend to use has around 8MB. Also, does these '' symbols mean that the nucleotide sequence was not able to be converted? Some webtools use '' symbol to indicate that the sequence has a region with frameshift.

ADD REPLY • link 6.2 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

Yes a local copy of EMBOSS should allow you to use large files. You could break your file up for multiple web submissions.

Above was a simple single frame translation. The * should be stop codons.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Hi again! This is just an update.

I've tested EMBOSS transeq and I've concluded that it doesn't work efficiently for this specific purpose. When we translate the sequence, the output contains too many - (dash) symbols. I don't know if there is any way to correctly translate a multiFASTA file, but apparently it's best to use an EMBL file and convert it to multiFASTA (exporting sequences as AA) using any software that allows visualization as previously discussed in another post of mine.

ADD REPLY • link 6.2 years ago by Alec Watanabe ▴ 60

0

Entering edit mode

I think you should not be getting any - in your output. See the transeq help page here. Something is not right.

To reiterate my comment from your other thread: If these are standard reference genomes from human microbiome project then why are you not getting the protein fasta files from NCBI? Those sequences will be full protein sequences without these stop codons.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

you should not indeed.

Are there any IUPAC code bases in your sequence? or perhaps some 'hidden' characters (did you open/manipulate the sequenes in a dos/windows environment?)

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

0

Entering edit mode

I just want to add that to have a meaningful conversion of this DNA to protein , it is probably key to first identify the correct translation frame. It does not really make sense to translate DNA into protein without taking the frame into account.

MEDpipe indeed requires a protein file as input

ADD REPLY • link 6.2 years ago by lieven.sterck 15k

0

Entering edit mode

Indeed. I was thinking about changing from MEDpipe to inmembrane. From what I understand, inmembrane uses the same SurfG+ software and other dependencies, it also has a gram-negative protocol and apparently it works with a DNA multifasta file. Both tools generates a .csv file with a similar result output.

ADD REPLY • link 6.2 years ago by Alec Watanabe ▴ 60