Question

Need Help: output of Mafft aligner

0

Entering edit mode

4.9 years ago

shiv ▴ 10

Hi,

I am using mafft aligner for multiple sequences alignment by command line. I am taking output in clustalw format, but the problem is that sequence identifiers are longer than 14-15 characters for every sequences (I have to keep them as they are) and mafft returns only till 14th characters for each ids and I want complete identifiers name in output file. Is there any option to get the full id in mafft output or something I missed in tutorial ???

Thanks in advance !!

alignment • 3.2k views

ADD COMMENT • link 4.9 years ago by shiv ▴ 10

0

Entering edit mode

Use a different output format. A number of formats have a hard limit on ID characters. Clustal format is similar to strict phylip which has a hard limit of 16 characters.

I’d advise switching to an aligned fasta or something. Almost every tool can accept the latter.

ADD REPLY • link 4.9 years ago by Joe 21k

0

Entering edit mode

Hi, Thanks for reply!

I need a output in clustal format only.. What should I do ??

ADD REPLY • link 4.9 years ago by shiv ▴ 10

0

Entering edit mode

Can you not simply edit the output file, which should be plain text (?)? You could use a regex to alter the sequence identifiers via, for example, sed. Make it so that it matches beginning of line (^)

ADD REPLY • link 4.9 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi, Thanks for your suggestion. I used ClustalO and it solved my issue.

Thankyou so much !!!

ADD REPLY • link 4.9 years ago by shiv ▴ 10

score 2 · Answer 1 · 2019-07-06

What tool are you using that is that restrictive? The only thing you can really do in this case is create a set of new identifiers and a mapping between new and old, then use a text replacement approach.

e.g. $ cat mymapfile.csv

Old_long_Identifier_Alpha,IDAlpha
Old_long_Identifier_Beta,IDBeta
....

What mappings you use obviously depends how uniquely identifiable your headers are to start with.

You could achieve this with something like:

while IFS=$"," read -r -a array ; do sed -i.bak "s/${array[0]}/${array[1]}/g" clustalfile.clw ; done < mapfile.csv

Be aware, that you will also need to pad the whitespace to ensure the sequence columns remain in their original space. If you can avoid the need for a clustal file though I strongly advise you take a different approach. Manually editing strict files can be very tedious and error prone.

Othewise, consider using something like ClustalO (a new version), which outputs clustal files without the need for short IDs.