Question

Analysis Of A Conserved Motif Across A Phylum

3

Entering edit mode

12.1 years ago

Brian ▴ 30

Hello, I am a recent college grad who is doing my first bioinformatics work as part of a larger research project in my lab (not a bioinformatics focused lab). I am trying to identify the conserved portions of a protein degredation domain which is covalently attached to proteins via tmRNA. The tag is encoded by normal RNA->AA rules in the tmRNA and I have been able to identify the tag sequence in most of the 65 sequenced genomes of the phylum of interest.

My main difficulty is how to interpret this data to conclude which positions in the protein tag are the most highly conserved. My first inclination would be to simply find the distibution of amino acids at each position in the tag with reference to the stop codon at the end of the tag. I see two major faults in this method though.

It does not account for insertions/deletions which can shift sequences which still would remain conserved but would not be identified as such with the above method.
It is biased towards the motifs of organisms which have had multiple different strains sequenced.

So my question is how to identify conserved elements within a motif? Is the a standard method or program which is used?

My only current thought of an improvement is to address point 2 by weighting the sequences by how phylogenetically dissimiliar they are to the others. A slightly more comprehensive graphic than the one in b of this figure , (paper) is similiar to what I am looking for. Thanks!

motif protein protein • 2.4k views

ADD COMMENT • link updated 12.1 years ago by SES 8.6k • written 12.1 years ago by Brian ▴ 30

score 2 · Answer 1 · 2012-03-21

Make a multiple sequence alignment to address point (1). You can address point (2) by treating things phylogenetically. Use aaml in PAML to find the optimal gamma-distributed rates-across-sites model for your alignment and precomputed tree. Each column in the alignment will have a posterior rate given in the "rates" output file. This rate represents the relative rate of change of that site, and should not be biased by the inclusion of many closely related sequences, for example.

score 1 · Answer 2 · 2012-03-21

Have you tried multiple sequence alignment? The most popular programs for MSA are ClustalW, T-COFFEE, MAFFT, Muscle. You can access them at the EBI site. But there's more, for example ProbCons. If you are really interested in which one best suits your needs, search PubMed.

The output of MSA is exactly the graphic you want. You might need to do some preprocessing though (eg. extract just the tag sequence you are interested in and with some bps upstream and downstream).

score 1 · Answer 3 · 2012-03-21

There are two methods called Phylogenetic Shadowing and Phylogenetic Footprinting that are commonly used for investigating conservation of elements across a phylogeny. There is a really good demonstration of both techniques in a Plant Cell paper where these methods were applied to identify the regulatory elements of floral genes in Arabidopsis.

To take a more genomics approach, this would be a good job for HMMER. The approach would be to build a model of your alignment and search the genomes of interest. Then, for each species you could construct a sequence logo to identify conserved sites that may be functionally important. This is similar to the other methods. Note that this does not tell you anything about function. That Plant Cell paper is a fine example where they developed computational predictions and then systematically created knockouts and constructs to identify the role of the conserved motifs.