Question

Making seq chimeras and concatenation in protein MSAs

0

Entering edit mode

6.5 years ago

roussine ▴ 10

Hello folks,

can you think of any existing tool that would take a bunch of seqs in a protein fasta that ARE already ALIGNED and just merge those in a sort of a chimera-concatenate. Pieces might not overlap, and mismatches would produce X's or smth.. Like in this simple example.

>..       AAAAA---ABA------BAAAA
>..       ---AAAA-------BBBAB---
>output   AAAAAAA-ABA---BBBXXAAA

The need is very obvious but what I ever encounter among tools around is "concatenate only non-overlapping pieces", "restrict to the longest containing sequence", etc. The seqs are protein, and the tool need be suitable for piping, so no GUI needed. Can anyone suggest?.. Thanks in advance.

alignment concatenation consensus consambig emboss • 1.5k views

ADD COMMENT • link 6.5 years ago by roussine ▴ 10

score 1 · Answer 1 · 2017-11-01

Hi,

Convert to fasta and use consambig (EMBOSS tool).

Example: consambig test test_out

Input: test

">1

AAAAA---ABA------BAAAA

">2 ---AAAA-------BBBAB---

Output:test_out

">EMBOSS_001

aaaAAaanabannnbbbNNaaa

ignore " in front of ">" sign, use exact fasta format.

Instead of 'X', 'N' is written and "-" is replaced with "n".

You can find more here "http://emboss.sourceforge.net/apps/cvs/emboss/apps/consambig.html"

score 0 · Answer 2 · 2017-11-02

Thank you, gorgeous and simple tool, did not come across it before. Can I please ask you. It uses the IUPAC table for making substitutions for the consensus, and the system table is:

# IUB codes for proteins
# Substitution is for OR'd A=1, C=2, D=4 etc.
#
A   1   A   alanine
B   2052    DN  aspartate/asparagine
C   2   C   cysteine
D   4   D   aspartate
E   8   E   glutamate
F   16  F   phenylalanine
G   32  G   glycine
H   64  H   histidine
I   128 I   isoleucine
J   640 IL  leucine/isoleucine
K   256 K   lysine
L   512 L   leucine
M   1024    M   methionine
N   2048    N   asparagine
O   2097152 O   pyrrolysine
P   4096    P   proline
Q   8192    Q   glutamine
R   16384   R   arginine
S   32768   S   serine
T   65536   T   threonine
U   1048576 U   selenocysteine
V   131072  V   valine
W   262144  W   tryptophan
X   1048575 ACDEFGHIKLMNPQRSTVWY    unknown
Y   524288  Y   tyrosine
Z   8200    EQ  glutamate/glutamine
-   0   -   gap

What the second column means?.. Frequences, penalties..? why 2x-divisible? did not find this in emboss docs. Ok, now: if I want to preserve gaps in the consensus and try to introduce the new state "gap" (the end on the list, my add) it just produces meaningless output, whichever values I try. Can you suggest?.. This prog does the job anyway, thank you.