Making seq chimeras and concatenation in protein MSAs
2
0
Entering edit mode
6.5 years ago
roussine ▴ 10

Hello folks,

can you think of any existing tool that would take a bunch of seqs in a protein fasta that ARE already ALIGNED and just merge those in a sort of a chimera-concatenate. Pieces might not overlap, and mismatches would produce X's or smth.. Like in this simple example.

>..       AAAAA---ABA------BAAAA
>..       ---AAAA-------BBBAB---
>output   AAAAAAA-ABA---BBBXXAAA

The need is very obvious but what I ever encounter among tools around is "concatenate only non-overlapping pieces", "restrict to the longest containing sequence", etc. The seqs are protein, and the tool need be suitable for piping, so no GUI needed. Can anyone suggest?.. Thanks in advance.

alignment concatenation consensus consambig emboss • 1.5k views
ADD COMMENT
1
Entering edit mode
6.5 years ago

Hi,

Convert to fasta and use consambig (EMBOSS tool).

Example: consambig test test_out

Input: test

">1

AAAAA---ABA------BAAAA

">2 ---AAAA-------BBBAB---

Output:test_out

">EMBOSS_001

aaaAAaanabannnbbbNNaaa

ignore " in front of ">" sign, use exact fasta format.

Instead of 'X', 'N' is written and "-" is replaced with "n".

You can find more here "http://emboss.sourceforge.net/apps/cvs/emboss/apps/consambig.html"

ADD COMMENT
0
Entering edit mode
6.5 years ago
roussine ▴ 10

Thank you, gorgeous and simple tool, did not come across it before. Can I please ask you. It uses the IUPAC table for making substitutions for the consensus, and the system table is:

# IUB codes for proteins
# Substitution is for OR'd A=1, C=2, D=4 etc.
#
A   1   A   alanine
B   2052    DN  aspartate/asparagine
C   2   C   cysteine
D   4   D   aspartate
E   8   E   glutamate
F   16  F   phenylalanine
G   32  G   glycine
H   64  H   histidine
I   128 I   isoleucine
J   640 IL  leucine/isoleucine
K   256 K   lysine
L   512 L   leucine
M   1024    M   methionine
N   2048    N   asparagine
O   2097152 O   pyrrolysine
P   4096    P   proline
Q   8192    Q   glutamine
R   16384   R   arginine
S   32768   S   serine
T   65536   T   threonine
U   1048576 U   selenocysteine
V   131072  V   valine
W   262144  W   tryptophan
X   1048575 ACDEFGHIKLMNPQRSTVWY    unknown
Y   524288  Y   tyrosine
Z   8200    EQ  glutamate/glutamine
-   0   -   gap

What the second column means?.. Frequences, penalties..? why 2x-divisible? did not find this in emboss docs. Ok, now: if I want to preserve gaps in the consensus and try to introduce the new state "gap" (the end on the list, my add) it just produces meaningless output, whichever values I try. Can you suggest?.. This prog does the job anyway, thank you.

ADD COMMENT
0
Entering edit mode

Hi,

Sorry, I did not test and go into details of it. If you want "-" instead of "n" simply replace it. Good that it worked for you.

ADD REPLY
0
Entering edit mode

Hi Puli, thanks for the reply. In the first place, the prog outputs a binary representation of a character (-) that it does not find in the conversion table (\00). Ok, I can get things out of the binary into text, but a neater way would be to introduce a new character in the table, so that the consensus contains the initial "-" (..thousands of files) . As I tested, it did not go. Please let me know if you think the table cannot adopt a new character state. Thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 3146 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6