Question

Preventing EMBOSS transeq from renaming sequences

2

Entering edit mode

9.6 years ago

Eric Normandeau 11k

I'm using the EMBOSS transeq tool to translate the first ORF of 26,000 sequences. Since the tool is pretty slow (takes ~1 minute per short sequence) and to make the process parallel, I split the fasta file into smaller files (down to one sequence per file) and then run transeq on each file. However, the sequences are renamed in the following format: EMBOSS_001_1.

How could I prevent transeq from renaming these sequences?

If nothing else works, I'll create a script that manages the translation of each individual sequences and makes sure to rename it after it has been translated.

emboss translation transeq • 7.5k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Eric Normandeau 11k

0

Entering edit mode

From your code I am guessing that the issue is that you don't want the reading frame suffix (e.g. "_1") to be added to the sequence identifier by EMBOSS transeq?

Assuming that that is the case, it might be worth asking if there is a way to suppress the suffix addition on the EMBOSS mailing lists (see http://emboss.open-bio.org/html/use/ch03s04.html), so the developers can have a look.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by hpmcwill ★ 1.2k

0

Entering edit mode

Sorry, that is not what I am looking for. My sequences have names and those are completely removed and replaced by EMBOSS_001_1. I would like to retain the original names.

ADD REPLY • link 9.6 years ago by Eric Normandeau 11k

0

Entering edit mode

It is the FASTA sequence ID which is being renamed to EMBOSS_001_1 ? I have not seen that happen before. For example, my sequence:

>seq1

becomes

>seq1_1

What do your IDs look like?

ADD REPLY • link 9.6 years ago by Neilfws 49k

0

Entering edit mode

My sequences were named after the info from annotating them with Maker2. Here are some examples:

>maker-scaffold2802|size94912-snap-gene-0.6-mRNA-1
>maker-scaffold28042|size8541-snap-gene-0.3-mRNA-1
>maker-scaffold28049|size7796-snap-gene-0.2-mRNA-1
>maker-scaffold2804|size94792-snap-gene-0.13-mRNA-1

ADD REPLY • link 9.6 years ago by Eric Normandeau 11k

0

Entering edit mode

That's interesting; on my machine, using those IDs causes transeq to hang with a high CPU load. Removing the pipe symbol from the ID fixes that issue. So I guess the pipes are the problem and maybe you have a newer version of EMBOSS which deals with this by renaming.

ADD REPLY • link 9.6 years ago by Neilfws 49k

1

Entering edit mode

The presence of the '|' triggers identifier parsing to extract the database name, accession, entry name, etc. In the EMBOSS 6.6.0 release there was a bug that meant this parsing behaved strangely when using a two field pipe separated identifier (two fields using colon separation is fine as is three or more pipe separated fields). This was fixed after release, and should be available in the post release patches (see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/).

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by hpmcwill ★ 1.2k

0

Entering edit mode

In that case, since you are using fasta formatted input sequences and want to preserve the identifiers as provided in the headers, you will need to use the 'pearson' format explicitly rather than using 'fasta' or format auto-detection. The 'pearson' format treats the identifier as the first non-whitespace token on the header line, and does not attempt to parse structured identifiers. So add '-sformat pearson' to your command-line, and you output should have the expected identifiers with the addition of the '_1' frame suffix.

ADD REPLY • link 9.6 years ago by hpmcwill ★ 1.2k

0

Entering edit mode

Hi hpmcwill,

Could you please post this as an answer so that I can accept it? Thanks to you and Neilfws for helping find the bug. It both makes the process 1000 times faster (I was wondering how it could take one minute to translate a short sequence...) and retains the sequence name.

ADD REPLY • link 9.6 years ago by Eric Normandeau 11k

0

Entering edit mode

Consider it done :-)

ADD REPLY • link 9.6 years ago by hpmcwill ★ 1.2k

1

Entering edit mode

9.6 years ago

Eric Normandeau 11k

I'm still interested to see a proper way to do it, but here is the bash script I have implemented to do it:

#!/bin/bash
# Translate one sequence in a fasta file with transeq and rename the tranlated result
#
# Usage:
#   ./translate_and_rename.sh INPUTFILE FRAME

INPUTFILE=$1
FRAME=$2
TEMPFILE=$(echo $1 | perl -pe 's/\.fasta$/.temp/')
OUTPUTFILE=$(echo $1 | perl -pe 's/\.fasta$/.trans/')
SEQNAME=$(head -1 "$INPUTFILE")

# Translate sequence
transeq -sequence "$INPUTFILE" -outseq "$TEMPFILE" -frame "$FRAME"

# Rename sequence
perl -sape 's/^>.*$/$o/' -- -o="$SEQNAME" "$TEMPFILE" > "$OUTPUTFILE"

# Remove temp file
rm "$TEMPFILE"

I then launch it with parallel:

find . | grep ".fasta$" | parallel ../01_scripts/translate_and_rename.sh {} 1

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Eric Normandeau 11k

0

Entering edit mode

6.5 years ago

peteladrien • 0

I recently came accross this exact probleme and couldn't find a solution (why -sformat option isn't displayed in the help... ) so I created an equivalent program in go, it's available here : gotranseq

it works exactly like emboss, tu run it use

gotranseq --sequence inputfile.fna --outseq out.faa --frame 6

ADD COMMENT • link 6.5 years ago by peteladrien • 0

Ram · Accepted Answer · 2014-09-09

Assuming that your input sequences are in fasta sequence format then one option which might help is to disable the parsing of the sequence identifier. By default EMBOSS attempts to identify the appropriate sequence format from the initial lines of the input. If the input is identified as being in a fasta sequence format variant (i.e. 'fasta' format) then identifier parsing will be used to extract information commonly encoded in structured identifiers, such as the database name, accession and entry name. For a list of possible fasta sequence format variants supported by EMBOSS see the "EMBOSS Users Guide" appendix "A.1. Supported Sequence Formats". The additional processing can be disabled by specifying the generic 'pearson' format fro the input, which uses the first non-whitespace token on the header line as the sequence identifier. The input sequence format is specified using the -sformat option, so your command-line will look something like:

transeq -frame 1 -sformat pearson -sequence inSeqFile.fa -outseq outSeqFile.fa

Please note that the EMBOSS 6.6.0 release contains a bug which makes the parsing of sequence identifiers consisting of two pipe separated fields (e.g. db|seqId) very slow, which might explain the poor performance you are seeing. A fix for this problem should be available in the post release patches (see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/), and the issue did not appear in the previous release (EMBOSS 6.5.7).