Question: Verifying A Quality Improvement For Multiple Sequence Aligments (Msa)
6
gravatar for Aleksandr Levchuk
8.5 years ago by
United States
Aleksandr Levchuk3.2k wrote:

Background

Earlier I asked a question on how to measure the "quality" of a multiple sequence alignment: Ab initio methods for inferring quality of Multiple Sequence Alignments

...there is also a duplicate: Multiple sequence alignment score

...and a different question about MSA similarity score: Similarity score of Multiple Sequence Alignment

So I have some tools to measure how "good" the multiple alignments are.

I can also make the MSAs "better" by using the scoring tools in the following way:

  1. Measure a score of the initial MSA.
  2. Remove the first sequence from the MSA, re-align, and re-measure the score
  3. If the score got worse then put the sequence back
  4. If the removal of all sequences has been tried then STOP, otherwise go to step 2

This should work because of the garbage-in-garbage-out nature of MSA. Hopefully, if I filter out the input, making it non-garbage then I should get a "good" output even for distantly related genes.

The Question

How can I verify that the MSAs actually did get better?

I'm interested in both closely and distantly related groups of proteins.

Things that I tried

...tried to think about.

  1. Verifying against the tiny fraction of groups of proteins that are know to be related to each other based on evidence of 3D structure superpositioning.
  2. Building a phylogenetic tree from the MSA and verifying the tree against know taxonomy of species. Using a simple rule (assumption) that most genes, unless they were horizontal transfered should have the same species ancestry as the whole organize.
quality multiple • 2.9k views
ADD COMMENTlink modified 8.1 years ago by Michael Kuhn5.0k • written 8.5 years ago by Aleksandr Levchuk3.2k
5
gravatar for Ari
8.5 years ago by
Ari90
Ari90 wrote:

"Goodness" of a MSA depends on the analysis/application you plan to use it for. Although they correlate strongly, evolutionary homology and structural homology are not the same: an alignment that is good for structural studies may give biased results in an evolutionary studies and vice versa.

If you infer alignments for structural studies, you should aim to score well in a structural benchmark such as BAliBASE (they may be other less biased benchmarks available). If you use your alignments for evolutionary analyses, you may want to test your method with simulated phylogenetic data: in my opinion, the best simulation program currently available is INDELible.

Even when you known the true solution (=simulated alignment), measuring the "goodness" of a test alignment is challenging. A widely-used solution is to compare the proportions of correct columns or correct residue pairs ("sum-of-pairs"). However, the former of these is far too strict for more than a few sequences and the latter is problematic as the score heavily depends on the way gaps are counted (if at all) and if one measures the number of correct pairs or incorrect pairs or a combination of these. Furthermore, both of them are meaningless unless they work as proxies for the "goodness" in the actual analysis you want to use your alignments for.

If you have a specific application for your alignments and you can simulate realistic sequence data, you can do the full analysis pipeline and see which aligment method/approach gives the most accurate results. Fletcher & Yang (MBE, 2010) did that for inference of positive selection in protein-coding genes and found that the methods performing best in "traditional" MSA benchmarks did surprisingly poorly. To me this indicates that there is no MSA method and measure of MSA goodness that suit every analysis.

ADD COMMENTlink written 8.5 years ago by Ari90
2
gravatar for Chris Evelo
8.5 years ago by
Chris Evelo10.0k
Maastricht, The Netherlands
Chris Evelo10.0k wrote:

Could it be that what you really want to do is a PSI-Blast, or at least that a PSI-Blast would be helpful? That would mean that you would start with a number of related sequences and then use an (automatically generated) aligment matrix based on what they have in common to find other related sequences (usually in other species).

ADD COMMENTlink written 8.5 years ago by Chris Evelo10.0k

We use an Hmmer3 based method to update the groups of sequences (protein families). The search space is all of in Uniprot (15 million sequences as of April). Many different species get picked up. I thought of taking advantage of that for verification purposes.

ADD REPLYlink written 8.5 years ago by Aleksandr Levchuk3.2k
2
gravatar for Bilouweb
8.5 years ago by
Bilouweb1.1k
Saclay, France
Bilouweb1.1k wrote:

A multiple alignment is based on a scoring function (often related to conservation). When you construct a MSA, you try to optimize the conservation of columns.

You can get a "better" alignment - by removing, adding or simply moving sequences - because MSA algorithms don't give an exact solution. In fact, you don't obtain a "better" alignment, you obtain an alignment which is better optimized for the scoring function.

So, when you ask "How can I verify that the MSAs actually did get better?", I suppose you want to know if the MSA actually better fits the reality and not just a mathematical function.

Thats why you want to test on a small set from Balibase (reality is given by 3D structures) or against a phylogenetic tree (reality is given by evolution).

But both are derived from a "mathematical function". Phylogenetic trees can be obtained by different algorithms (with different scores) and 3D structures can be compared with different metrics.

My idea is : A MSA is constructed by a scoring function so the quality can only be measured by the scoring function. If the function fits the reality, then your MSA fits the reality. We know that it is not so easy, different functions capture different properties.

So, when I measure the conservation of an alignment, I use a few different scoring functions to see if I improve a maximum of them. If so, I can suppose that my multiple alignment is "better".

ADD COMMENTlink written 8.5 years ago by Bilouweb1.1k

+1 about better fitting the reality. I also think trying different scoring functions is a really good idea. The one function that stands out would also reveal what type of relationship the sequences have: physiochemical, close evolutionary relatives, distant, or similar structure (e.g. transmembrane domains).

ADD REPLYlink written 8.5 years ago by Aleksandr Levchuk3.2k
2
gravatar for Cjt
8.5 years ago by
Cjt370
Cjt370 wrote:

For validation I would apply an in-silico study: Use an arbitrary input sequence and generate out of this sequence a second by mutating it. Then, iteratively, chose again any sequence and generate a child sequence until you reached the desired amount of candidates. Going this way the immense advantage is to have a reliable phylogeny your method needs to find. I think you should test all the mutation principles you can think of (ins/del/translocations/...) to get an estimation of the gain of your method.

By the way: I'm no convinced that your method will result in "better" alignments. There might be conflict cases where sequences indicate a different way of evolution. To detect these, you will have to remove several sequences from the alignment at the same time. And when you check it now for all combination of removable sequences, you are right back to the tree reconstruction methods. And of cause you will see all there problems as NP-completeness, etc.

ADD COMMENTlink written 8.5 years ago by Cjt370

I agree with this. The complexity arises quickly when you try to align sequences manually.

ADD REPLYlink written 8.5 years ago by Bilouweb1.1k
2
gravatar for Jan Kosinski
8.5 years ago by
Jan Kosinski1.6k
Jan Kosinski1.6k wrote:

I would rather focus on improving the alignment of the outliers rather than removing them to "improve" the MSA. The outliers may be the most interesting members of your sequence family: proteins with unusual properties, enzymes with interesting specificities, evolutionary intermediates to more distant sequences families.

With your methodology you don't make MSA better, you make it telling less about your protein family.

And direct answer to your current question is perhaps there: Issues in bioinformatics benchmarking: the case study of multiple sequence alignment

Perhaps good to read also: A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives

ADD COMMENTlink written 8.5 years ago by Jan Kosinski1.6k
2
gravatar for Michael Kuhn
8.5 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

There are methods that compute a "quality score" for an alignment, e.g. norMD. See also this related question, which sounds a lot like yours.

ADD COMMENTlink modified 7 days ago by RamRS24k • written 8.5 years ago by Michael Kuhn5.0k

+1 My question is using this methods to purify the MSAs. So your suggestion is to use the same scoring methods to see if the new MSAs improved?

ADD REPLYlink written 8.5 years ago by Aleksandr Levchuk3.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1977 users visited in the last hour