Question: How to convert a multiFASTA file with DNA (nucleotide) sequences into a multiFASTA with protein (amino acid) sequences?
0
gravatar for Alec Watanabe
13 months ago by
Alec Watanabe30 wrote:

Hello guys,

I have a multiFASTA file with DNA (nucleotide) sequences and I want to convert these sequences to protein (AA) sequences. I need to do this conversion because MEDpipe (a webtool) apparently only accepts FASTA file with AA sequences (I presume), because I keep getting errors when I submit the file. Since this multiFASTA file has over 7000 sequences, it would be really time consuming to translate one sequence each time (most webtools I've found only accepts one sequence per time). Is there any way to do this? Here is an example of the file I have:

>Proteus_mirabilis_ARLG2970_2781
ATGGAGACAGGTACAGTAAAGTGGTTCAATAATGCTAAGGGCTTTGGTTTTATTACCCCAGCAAACGGTG
GCGAAGATATTTTTGCCCACTATTCAACAATTAGAATGGAAGGCTACCGCACACTTAAAGCGGGGCAGAA
AGTTAATTATAGCACGATAAAAGGGCCTAAAGGTGACCATACTGACCTTATCATTCCTATCATTGAATAG
>Proteus_mirabilis_ARLG2970_0131
ATGTCTGACAAAATGAAAGGTCAAGTTAAGTGGTTCAACGAGTCTAAAGGCTTTGGTTTTATTACTCCAG
CAGACGGAAGCAAAGACGTATTCGTTCACTTTTCTGCCATTCAAGGTAACGGTTTCAAAACTCTGGCTGA
AGGTCAGAACGTAGAATTCACAATTGAAAACGGTGCAAAAGGTCCAGCAGCAGCTAACGTAACAGCTCTG
TAA
>Proteus_penneri_ATCC35198_1543
TTACAGAGCAGTTACGTTAGCAGCTGCTGGACCTTTTGCACCGTTTTCAATTGTGAATTCTACGTTCTGA
CCTTCAGCCAGAGTTTTGAAACCGTTACCTTGAATGGCAGAAAAGTGAACGAATACGTCTTTGCTTCCGT
CTGCTGGAGTAATAAAACCAAAGCCTTTAGACTCGTTGAACCACTTAACTTGACCTTTCATTTTGTCAGA
CAT
>Proteus_vulgaris_FDAARGOS366_2819
TTAGAGAGCCACCACGTTGCCTGCTGCTGGGCCTTTCATACCATTTTCCATGGTGAATGAAACTTGTTGC
CCTTCAGCTAATGTTTTGAAGCTATCACTTTGGATTGCAGAGAAATGTACGAATACATCTTTGCTGCCAT
CAGCTGGAGTAATAAAACCAAAACCTTTACCTTCATCGAACCATTTTACTGTACCAGTCATTGTATTAGA
CAT
>Proteus_mirabilis_ARLG2970_2695
TTACAGAGCGATTACGTTCGCTGCTGCAGGGCCTTTAGCGCCATTTTCAATAGAAAATGAAACTTCTTGG
CCTTCTTTCAGTGACTTGAAGCTTTCACTTTGGATCGCTGAAAAGTGTACGAATACGTCTTTGCTACCGT
CTTTAGGAGTGATAAAACCGAAGCCTTTATCATCGTTAAACCATTTTACTGTACCAGTCATTGTATTAGA
CAT
ADD COMMENTlink modified 13 months ago by genomax64k • written 13 months ago by Alec Watanabe30

For future reference: highlight text you want to code and then click on the 101 button in editor window to format it correctly.

ADD REPLYlink written 13 months ago by genomax64k

Thank you for the tip, I'm new to the community so I'm still learning, but I'll apply this knowledge now in my future posts!

ADD REPLYlink written 13 months ago by Alec Watanabe30
0
gravatar for genomax
13 months ago by
genomax64k
United States
genomax64k wrote:

You can use EMBOSS transeq via web interface or by downloading/installing EMBOSS.

I got the following using your sequences via web interface in one pass:

>Proteus_mirabilis_ARLG2970_2781_1
METGTVKWFNNAKGFGFITPANGGEDIFAHYSTIRMEGYRTLKAGQKVNYSTIKGPKGDH
TDLIIPIIE*
>Proteus_mirabilis_ARLG2970_0131_1
MSDKMKGQVKWFNESKGFGFITPADGSKDVFVHFSAIQGNGFKTLAEGQNVEFTIENGAK
GPAAANVTAL*
>Proteus_penneri_ATCC35198_1543_1
LQSSYVSSCWTFCTVFNCEFYVLTFSQSFETVTLNGRKVNEYVFASVCWSNKTKAFRLVE
PLNLTFHFVRH
>Proteus_vulgaris_FDAARGOS366_2819_1
LESHHVACCWAFHTIFHGE*NLLPFS*CFEAITLDCREMYEYIFAAISWSNKTKTFTFIE
PFYCTSHCIRH
>Proteus_mirabilis_ARLG2970_2695_1
LQSDYVRCCRAFSAIFNRK*NFLAFFQ*LEAFTLDR*KVYEYVFATVFRSDKTEAFIIVK
PFYCTSHCIRH
ADD COMMENTlink modified 13 months ago • written 13 months ago by genomax64k

Dear genomax,

Thank you for your response. Do you know if downloading/installing EMBOSS will let me use any file size? The complete multiFASTA file I intend to use has around 8MB. Also, does these '' symbols mean that the nucleotide sequence was not able to be converted? Some webtools use '' symbol to indicate that the sequence has a region with frameshift.

ADD REPLYlink written 13 months ago by Alec Watanabe30

Yes a local copy of EMBOSS should allow you to use large files. You could break your file up for multiple web submissions.

Above was a simple single frame translation. The * should be stop codons.

ADD REPLYlink written 13 months ago by genomax64k

Hi again! This is just an update.

I've tested EMBOSS transeq and I've concluded that it doesn't work efficiently for this specific purpose. When we translate the sequence, the output contains too many - (dash) symbols. I don't know if there is any way to correctly translate a multiFASTA file, but apparently it's best to use an EMBL file and convert it to multiFASTA (exporting sequences as AA) using any software that allows visualization as previously discussed in another post of mine.

ADD REPLYlink written 13 months ago by Alec Watanabe30

I think you should not be getting any - in your output. See the transeq help page here. Something is not right.

To reiterate my comment from your other thread: If these are standard reference genomes from human microbiome project then why are you not getting the protein fasta files from NCBI? Those sequences will be full protein sequences without these stop codons.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax64k

you should not indeed.

Are there any IUPAC code bases in your sequence? or perhaps some 'hidden' characters (did you open/manipulate the sequenes in a dos/windows environment?)

ADD REPLYlink written 13 months ago by lieven.sterck4.2k

I just want to add that to have a meaningful conversion of this DNA to protein , it is probably key to first identify the correct translation frame. It does not really make sense to translate DNA into protein without taking the frame into account.

MEDpipe indeed requires a protein file as input

ADD REPLYlink written 13 months ago by lieven.sterck4.2k

Indeed. I was thinking about changing from MEDpipe to inmembrane. From what I understand, inmembrane uses the same SurfG+ software and other dependencies, it also has a gram-negative protocol and apparently it works with a DNA multifasta file. Both tools generates a .csv file with a similar result output.

ADD REPLYlink written 13 months ago by Alec Watanabe30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1053 users visited in the last hour