Ab Initio Unsupervised Methods For Inferring Quality Of Multiple Sequence Alignments
5
6
Entering edit mode
10.9 years ago

MSA programs make mistakes. Even if no mistakes are made and output MSA is "optimal", we could still be looking at something Biologically meaningless (e.g. the input contains divergent sequences - evolutionary between input the sequences has been completely wiped-out by mutations because those genes were under different evolutionary pressure).

There seems to be 10 or so publications on automated MSA quality scoring.

EDIT (6 month later): Looks like I was wrong, it's not ~10 but ~4 unsupervised methods out there: norMD, Gblocks, HoT, and GUIDANCE. There is also PSAR that claims to be better than GUIDANCE but they only did testing on DNA sequences. And only a few MSA aligners report sites-specific confidence: SOAP, T-COFFEE, FSA.

Does anyone have experience on how these methods perform in practice?

multiple alignment quality • 3.1k views
ADD COMMENT
0
Entering edit mode

A simpler version of this question: Which Msa Scoring Methods Did You Use?

ADD REPLY
0
Entering edit mode
ADD REPLY
4
Entering edit mode
10.9 years ago
lh3 32k

Sorry for my bad English, but I am not sure what you are asking. Are you asking "how can I compare MSA programs without knowing the reference alignment?", or asking "what MSA program should I use in practice?", or asking "how many methods are available to evaluate MSA?" To the third question, everyone is using pretty much the same set of measurements. To the second question, this page may help a little.

To the first question, if it were me, I would evaluate the resulting phylogenetic tree instead of the alignment. Given a set of genes from species with a known phylogeny, we can perform MSA with different programs, reconstruct trees with ML/phyml and then evaluate the accuracy of gene trees by counting the minimum number of duplication events that explain the gene evolution. This measurement is powerful, although some argue that the best MSA does not necessarily lead to the best tree.

EDIT: as I am now clearer about your question...

I know MSA is used for homology searches and other purposes, but my experiences mainly come from building gene trees (not species tree). I think gene trees are more informative than MSA. On a gene tree, a lot of weirdness (e.g. pseudogene and distant paralogs) stands out which you cannot easily see from MSA.

BTW, for my purpose (gene tree building) I do not care too much about the theoretical accuracy of these MSA programs. I only choose conservative columns and sometimes try a clustalx heuristic to screen out obviously misaligned segments.

ADD COMMENT
0
Entering edit mode

+1 You English if perfect. Those are all important questions but I'm wondering about a more general question of quality. I added and intro paragraph to my question, hopefully it will better describe what I mean.

ADD REPLY
2
Entering edit mode
10.9 years ago

Landan and Graur (2008) propose a simple reliability measure based on the discrepancy between the alignments produced by a set of sequences in the forward or reverse direction that you might want to consider:

"The proposed methodology is based upon the a priori expectation that sequence alignment results should be independent of the orientation of the input sequences. Thus, for totally unambiguous cases, reversing residue order prior to alignment should yield an exact reversed alignment of that obtained by using the unreversed sequences. Such “ideal” alignments, however, are the exception in real life settings, and the two alignments, which we term the heads and tails alignments, are usually different to a greater or lesser degree. The degree of agreement or discrepancy between these two alignments may be used to assess the reliability of the sequence alignment. Furthermore, any alignment dependent sequence analysis protocol can be carried out separately for each of the two alignments, and the two sets of results may be compared with each other, providing us with valuable information regarding the robustness of the whole analytical process. The heads-or-tails (HoT) methodology can be easily implemented for any choice of alignment method and for any subsequent analytical protocol."

Some further thoughts on the HoT method can be found on Thomas Mailund's blog.

ADD COMMENT
0
Entering edit mode

+1 Sounds very interesting.

ADD REPLY
0
Entering edit mode

+1 It's eligant but does not go very deep into disturbing the MSA program. The new method GUIDANCE was found to be much more effective than HoT. The authors describe it in a new paper (An Alignment Confidence Score Capturing Robustness to Guide Tree Uncertainty) where they perturbed the guide tree reveal the MSA program's uncertainty even more. The also developed a web service for this (http://guidance.tau.ac.il)

ADD REPLY
1
Entering edit mode
10.9 years ago
Paulo Nuin ★ 3.7k

I would use Qscore from the same developer of Muscle. I used some quality metrics on my 2006 paper and then we used some lab developed software to calculate the scores. But I guess Qscore uses a similar metric.

ADD COMMENT
1
Entering edit mode

Not exactly, but how do you measure quality without the "correct" alignment?

ADD REPLY
0
Entering edit mode

+1 This is interesting...

ADD REPLY
0
Entering edit mode

+1 I will try Qscore, but not sure if we have good "reference" alignments

ADD REPLY
0
Entering edit mode

That's one problem of any quality score for alignments. Depending on the type of sequences you are aligning you might need to rely on the best software available and then check by eye.

ADD REPLY
0
Entering edit mode

Qscore, SP and TC are pretty much everyone is using. But are these "ab initio" methods given that they all require a reference alignment?

ADD REPLY
0
Entering edit mode

That is my answer below. We can evaluate trees instead of alignments. After all, MSA of protein sequences is mainly used for building trees. In addition, a few papers use secondary structure or protein domain overlapping to evaluate MSA, although I do not quite like it.

ADD REPLY
0
Entering edit mode

And what is the "correct" tree? And alignments are not used only for trees.

ADD REPLY
0
Entering edit mode

First I did not say MSA is "only" used for trees. I know MSA can also be used for homology searches and so on. Second you should read my answer below. I made it clear how to evaluate a gene tree without knowing the "correct" tree, although this method is more useful to many gene trees. Why we have to know the "correct" answer which itself is questionable? There are ways to evaluate without knowing the truth.

ADD REPLY
0
Entering edit mode

I read your answer below, and there's no "known phylogeny", phylogenies are just estimations. And even if you measure the "minimum number of duplication events" there's no guarantee that you have the "correct" or "known phylogeny". Trees and alignments are just estimations, you can measure the alignment quality or if the tree makes sense on the light of statistics, but you will never be 100% sure in any case.

ADD REPLY
0
Entering edit mode

I am a little surprised that you criticize my approach is not "100%" accurate -- you cannot guarantee the correctness of MSA, either. If it were me, I would say my method does not have enough power. If you criticized in that way, I would say having an independent measurement is necessary when every MSA program is optimized on the few benchmark data sets.

ADD REPLY
1
Entering edit mode
10.9 years ago
Tom Walsh ▴ 550

If this is for protein MSA you could try the following structure-based methods that don't require a reference alignment (but obviously work only on data sets for which you have structures).

The Oxbench package (http://www.compbio.dundee.ac.uk/Software/Oxbench/oxbench.html) also includes a structure-based metric. Full disclosure: I work in the group that developed Oxbench.

T-Coffee www.tcoffee.org) includes the iRMSD metric for assessing alignment quality.

APDB (http://www.ncbi.nlm.nih.gov/pubmed/17032685)

ADD COMMENT
1
Entering edit mode
10.9 years ago
Rm 8.1k

Either if you have structures for the protein sequences or if homologous to any PDB structures, you can try: "QUASAR—scoring and ranking of sequence–structure alignments"

PROMALS web server for accurate multiple protein sequence alignments.

ADD COMMENT
0
Entering edit mode

+1 Thanks for the links

ADD REPLY

Login before adding your answer.

Traffic: 1634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6