Question

How Can I Profile A Multiple Alignment Result To Get A Logo Sequence To Represent All Aligned Sequence?

5

Entering edit mode

12.8 years ago

Ct586 ▴ 620

I have several similar protein sequences, like

>a
GKGGGIGGGIGKGGG
>b
GKGGGIGGGIGKGGGIGGG
>c
GKGGGIGGGIGKGGGIGGGI
>d
GKGGGVGGGIGKGGG

Then, I aligned them use Clustral, get results like

CLUSTAL 2.1 multiple sequence alignment


b               GKGGGIGGGIGKGGGIGGG-
c               GKGGGIGGGIGKGGGIGGGI
a               GKGGGIGGGIGKGGG-----
d               GKGGGVGGGIGKGGG-----
                *****:*********

I wonder how I can use one sequence to represent the mentioned four sequences with no or the least loss of informations.

Before, I tried HMMER which can use a Hidden Markov Model to profile sequences. The results contain in a matrix model.

And when I wrote down the title, biostar system recommends a question Score protein variants based on frequency of AA in multiple sequence alignment, which solution is similar with HMMER.

Also Weblogo can give me a picture to show the motif sequences, but I think it will cause loss of information and picture is not suitable for batch processing.

There is a picture in paper BH3-only proteins in apoptosis and beyond: an overview, I saw the picture below.

alt text

It use special characters to represent similar amino acids.

Before I find a more suitable expression, I think this result is what I want.

So will you guys recommend some tools to solve this?

Thank you!

motif sequence • 4.6k views

ADD COMMENT • link updated 12.8 years ago by Lyco ★ 2.3k • written 12.8 years ago by Ct586 ▴ 620

0

Entering edit mode

I like the representation of motifs as regular expressions: http://elm.eu.org/help.html#nomenclature (but don't know of an automated conversion tool). I find special characters to be very annoying as a reader of a paper because I have to go back and forth between the legend and the sequence.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.8 years ago by Michael Kuhn 5.0k

0

Entering edit mode

Michael, RegEx for motifs works only sometimes. They form the basis of motif databases such as PROSITE and ELM. However, Many real motifs have small variations of the consensus that would violate a regular expression, unless you formulate it very inclusive to catch all instances. However, this also means lots of false positive. Unless motifs have a very strict consensus requirement, they are difficult to treat by regular expressions.

ADD REPLY • link 12.8 years ago by Lyco ★ 2.3k

0

Entering edit mode

Thank you! I have thought about regular expression, it works well sometimes. But it is hard to construct many regular expressions. And the information lost is serious also. I agree with what Lyco said. Thanks for your hint.

ADD REPLY • link 12.8 years ago by Ct586 ▴ 620

Ram · Answer 1 · 2011-07-11

4

Entering edit mode

12.8 years ago

Lyco ★ 2.3k

The kind of consensus display used in the BH3 paper is associated with a substantial loss of information, as the fancy greek symbols only represent 'majority votes', neglecting minority observations. Moreover, the symbols are not standardized - the only special symbol most authors agree on is the uppercase Phi for hydrophobics.

In fact, sequence logos loose much less informations than consensus sequences (and they can be generated in batch mode, too. There is a program called seqlogo which can be downloaded from the weblogo pages. The major disadvantage is that the weblogos are bitmap images and cannot be used as simple text items. If this isn't a problem for you, I would recommend sequence logos over consensus sequence display.

The least information loss is suffered when using frequency tables, basically two-dimensional matrices showing which residue is observed how often at what position. On the cumbersome for display in papers.

By the way, don't trust the BH3 consensus provided in the paper. At least a third of the sequences shown are no genuine BH3 motifs.

ADD COMMENT • link 12.8 years ago by Lyco ★ 2.3k

1

Entering edit mode

Agreed. If you want to visually capture the information to show to an audience in a paper a sequence logo retains the most information. If you want to actually DO something computationally you're better off with a PSSM, Markov Model (HMMER), or similar

ADD REPLY • link 12.8 years ago by DG 7.3k

0

Entering edit mode

Thank you! I think this is the best strategy here. I will use HMM profile which contain the most information as the motif to do the searching part, and weblogo to represent the consensus visually.

ADD REPLY • link 12.8 years ago by Ct586 ▴ 620

0

Entering edit mode

Hello. My answer is maybe too late but indeed, HMMER + WebLogo is a good combo to 1/ catch the information contained in a batch of related sequences and 2/ represent the amino acid characteristics of these proteins. We did this last year in this paper http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0009990

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.8 years ago by Pierre Poulain ▴ 440