Making seq chimeras and concatenation in protein MSAs
2
0
Entering edit mode
5.1 years ago
roussine ▴ 10

Hello folks,

can you think of any existing tool that would take a bunch of seqs in a protein fasta that ARE already ALIGNED and just merge those in a sort of a chimera-concatenate. Pieces might not overlap, and mismatches would produce X's or smth.. Like in this simple example.

>..       AAAAA---ABA------BAAAA
>..       ---AAAA-------BBBAB---
>output   AAAAAAA-ABA---BBBXXAAA


The need is very obvious but what I ever encounter among tools around is "concatenate only non-overlapping pieces", "restrict to the longest containing sequence", etc. The seqs are protein, and the tool need be suitable for piping, so no GUI needed. Can anyone suggest?.. Thanks in advance.

alignment concatenation consensus consambig emboss • 1.2k views
1
Entering edit mode
5.1 years ago

Hi,

Convert to fasta and use consambig (EMBOSS tool).

Example: consambig test test_out

Input: test

">1

AAAAA---ABA------BAAAA

">2 ---AAAA-------BBBAB---

Output:test_out

">EMBOSS_001

aaaAAaanabannnbbbNNaaa

ignore " in front of ">" sign, use exact fasta format.

Instead of 'X', 'N' is written and "-" is replaced with "n".

You can find more here "http://emboss.sourceforge.net/apps/cvs/emboss/apps/consambig.html"

0
Entering edit mode
5.1 years ago
roussine ▴ 10

Thank you, gorgeous and simple tool, did not come across it before. Can I please ask you. It uses the IUPAC table for making substitutions for the consensus, and the system table is:

# IUB codes for proteins
# Substitution is for OR'd A=1, C=2, D=4 etc.
#
A   1   A   alanine
B   2052    DN  aspartate/asparagine
C   2   C   cysteine
D   4   D   aspartate
E   8   E   glutamate
F   16  F   phenylalanine
G   32  G   glycine
H   64  H   histidine
I   128 I   isoleucine
J   640 IL  leucine/isoleucine
K   256 K   lysine
L   512 L   leucine
M   1024    M   methionine
N   2048    N   asparagine
O   2097152 O   pyrrolysine
P   4096    P   proline
Q   8192    Q   glutamine
R   16384   R   arginine
S   32768   S   serine
T   65536   T   threonine
U   1048576 U   selenocysteine
V   131072  V   valine
W   262144  W   tryptophan
X   1048575 ACDEFGHIKLMNPQRSTVWY    unknown
Y   524288  Y   tyrosine
Z   8200    EQ  glutamate/glutamine
-   0   -   gap


What the second column means?.. Frequences, penalties..? why 2x-divisible? did not find this in emboss docs. Ok, now: if I want to preserve gaps in the consensus and try to introduce the new state "gap" (the end on the list, my add) it just produces meaningless output, whichever values I try. Can you suggest?.. This prog does the job anyway, thank you.

0
Entering edit mode

Hi,

Sorry, I did not test and go into details of it. If you want "-" instead of "n" simply replace it. Good that it worked for you.

0
Entering edit mode

Hi Puli, thanks for the reply. In the first place, the prog outputs a binary representation of a character (-) that it does not find in the conversion table (\00). Ok, I can get things out of the binary into text, but a neater way would be to introduce a new character in the table, so that the consensus contains the initial "-" (..thousands of files) . As I tested, it did not go. Please let me know if you think the table cannot adopt a new character state. Thanks again!

Traffic: 1569 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.