Question

Effective Scoring Function To Compare Multiple Sequence Alignments Or Phylogenetic Trees

1

Entering edit mode

11.3 years ago

Pappu ★ 2.1k

I am wondering if there is any good scoring function to compare MSAs obtained from ClustalW, Muscle, T-coffee, Kalign etc. I read that the differences are not statistically significant. Is it true? I have the same question in case of phylogenetic trees.

msa python • 4.2k views

ADD COMMENT • link updated 11.3 years ago by aidan-budd 1.9k • written 11.3 years ago by Pappu ★ 2.1k

score 5 · Answer 1 · 2013-01-09

What constitutes a "good scoring function" depends on the question you're trying to address by building the MSA. So it's hard/impossible to give a definitive answer to this question.

The ideal scoring function would always give the best score to the "true" MSA i.e. the alignment in which all residues asserted to belong in the same column share the property you want them to (greatest structural-context similarity, "homology" etc.) and that this column contains all such residues in the alignment that belong in that column.

There are a range of different (and, if you're geeky about these things like me...) fascinating issues associated with trying to develop an effective/"good" scoring scheme. For example, to develop such a scoring scheme, we presumably need some kind of benchmark of "true" alignments to do it against. But how do we identify "true" alignments (or at least columns within alignments which we believe have the properties we want them to have) without using a scoring scheme. This issue is very interestingly discussed in a recent paper by Iantorno et al. ("Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment, 2012, Stefano Iantorno, Kevin Gori, Nick Goldman, Manuel Gil, Christophe Dessimoz, arXiv:1211.2160 [q-bio.QM] http://pubget.com/paper/pgtmp_12112160/Who_Watches_the_Watchmen__An_Appraisal_of_Benchmarks_for_Multiple___Sequence_Alignment )

Here's a recent description of some different ways of scoring alignments, published last year by Blackburne and Simon Whelan http://bioinformatics.oxfordjournals.org/content/28/4/495.long

To be pragmatic, Thompson et al. describe here a way of scoring alignments that you can (or could last time I checked) get via a webserver article: http://bioinformatics.oxfordjournals.org/content/28/4/495.long server: http://bips.u-strasbg.fr/PipeAlign/

To be even more pragmatic... in my (reasonable amount of) experience working with, and teaching, MSAs, I've rarely (never?) used scores to discriminate between alignments like this. Not to say that there aren't context in which I could imagine wanting to do it - it's just that if I'm/we're not sure what the best alignment is that we have to choose from (or whether the alignment we're looking at is "good enough"), we tend to look at the alignments ourselves in alignment viewers (JalView, ClustalX, SEAVIEW etc.) - obvious "errors" (likely "wrong" sequence, large insertions/deletions, clearly very-difficult-to-align regions) we can spot ourselves by eye, some regions are often almost certainly (there's a set of interesting assumptions lying behind that assessment of certainty...) correct, and these regions most tools will correctly align. If the reason why we're building the MSA requires that these regions are correctly aligned, but not the other regions, then the alignment is good enough, and I go on to worry about something else instead.

Hm, I suspect I need to learn about structuring answers to these kinds of questions concisely. Friendly, constructive criticism appreciated! :)

score 2 · Answer 2 · 2013-01-10

For trees, there may well be a "scoring scheme" associated with your trees already i.e....

if you used maximum likelihood (ML) to estimate the "best" phylogeny to explain your data, then this would be the likelihood score for the tree (as you might expect, ML methods propose the most likely tree [the tree that makes the observed data, i.e. the multiple sequence alignment used to estimate the tree, most probable] as the "best" tree)

If you used maximum parsimony, this would be the parsimony score for the tree etc.

However, I guess that rather than asking "what is the best tree?" you're more interested in asking "which set of trees are not 'significantly' worse than my best tree?" (which is what Whetting, in the answer above, is helping you with).

As Whetting indicates, there are tests out there that allow you to do this kind of thing: ILD (I think only applicable in a parsimony framework...?), SH, and others.

These tests need to be used with considerable caution, though! This article, written by a bunch of phylogeneticists who understand the issues far better than I do, describe some of the issues:

Likelihood-based tests of topologies in phylogenetics. 2000. Syst Biol 49(4):652-670. Goldman, N, Anderson, JP, and Rodrigo, AG http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12116432

One important message to come from this is "in almost all cases do not use the Kishino-Hasegawa (KH) test!" which was one of the first tests of this kind to be described; the test itself is not "bad" but the specific question being addressed by the test is very very rarely the question the user wants to ask, leading to the results of the test being very commonly badly misinterpreted.

This highlights one of the things that makes using these tests tricky - understanding well what is actually being tested.

To be pragmatic - I am fairly sure that the SH test (and probably others) is implemented in the PAML package, and that PUZZLE (or TREE-PUZZLE, forget what it's called these days) does this too.

score 2 · Answer 3 · 2013-01-10

A final "answer" to your question from me.

Your question nicely signposts/highlights the deep interconnectedness of the "problems" of (i) multiple sequence alignment and (ii) phylogenetic tree estimation.

Perhaps we're interested in somehow combing assessments of alignment and tree uncertainty.

In that case, we could try to carry out joint estimation of alignment and tree, ranking near-optimal pairs of (alignment, tree) using a probabilistic model. Such an approach is implemented, for example, in the BaliPhy software:

BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny. 2006. Bioinformatics 22(16):2047-8. Suchard, MA and Redelings, BD

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=16679334

This approach is conceptually attractive - at some level, we can think of the tree and alignment estimation problems as being the "same" problem - what we really are looking, is often for is more-or-less exactly what BAli-Phy gives us.

However, unfortunately, the software is really slow, takes a long time to run even for smallish datasets.

Still, it might be worth having a play with it. At the very least, reading the papers gives some interesting insights into the nature of the problem that can be illuminating.

score 0 · Answer 4 · 2013-01-09

I agree with aidan-budd. I think it will be very hard to create a scoring algorithm for alignments. However, statistical tests for phylogenies do exist. These are not exactly scoring functions, but so-called hypothesis testing methods are described in the literature. The most basic test is probably /incongruence-length difference (ILD) test in Paup*. While this test was developed to test incongruencies between domains, at its core it compares phylogentic trees. The Shimodaira Hasegawa test is an alternative...