Question

Ab Initio Unsupervised Methods For Inferring Quality Of Multiple Sequence Alignments

6

Entering edit mode

13.4 years ago

Aleksandr Levchuk 3.2k

MSA programs make mistakes. Even if no mistakes are made and output MSA is "optimal", we could still be looking at something Biologically meaningless (e.g. the input contains divergent sequences - evolutionary between input the sequences has been completely wiped-out by mutations because those genes were under different evolutionary pressure).

~~There seems to be 10 or so publications on automated MSA quality scoring.~~

EDIT (6 month later): Looks like I was wrong, it's not ~10 but ~4 unsupervised methods out there: norMD, Gblocks, HoT, and GUIDANCE. There is also PSAR that claims to be better than GUIDANCE but they only did testing on DNA sequences. And only a few MSA aligners report sites-specific confidence: SOAP, T-COFFEE, FSA.

Does anyone have experience on how these methods perform in practice?

multiple alignment quality • 4.7k views

ADD COMMENT • link updated 12.6 years ago by Casey Bergman 18k • written 13.4 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

A simpler version of this question: Which Msa Scoring Methods Did You Use?

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.6 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

A norMD question: Anyone Use Normd As A Quality Control For Msas

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.6 years ago by Aleksandr Levchuk 3.2k

score 4 · Answer 1 · 2010-11-16

Sorry for my bad English, but I am not sure what you are asking. Are you asking "how can I compare MSA programs without knowing the reference alignment?", or asking "what MSA program should I use in practice?", or asking "how many methods are available to evaluate MSA?" To the third question, everyone is using pretty much the same set of measurements. To the second question, this page may help a little.

To the first question, if it were me, I would evaluate the resulting phylogenetic tree instead of the alignment. Given a set of genes from species with a known phylogeny, we can perform MSA with different programs, reconstruct trees with ML/phyml and then evaluate the accuracy of gene trees by counting the minimum number of duplication events that explain the gene evolution. This measurement is powerful, although some argue that the best MSA does not necessarily lead to the best tree.

EDIT: as I am now clearer about your question...

I know MSA is used for homology searches and other purposes, but my experiences mainly come from building gene trees (not species tree). I think gene trees are more informative than MSA. On a gene tree, a lot of weirdness (e.g. pseudogene and distant paralogs) stands out which you cannot easily see from MSA.

BTW, for my purpose (gene tree building) I do not care too much about the theoretical accuracy of these MSA programs. I only choose conservative columns and sometimes try a clustalx heuristic to screen out obviously misaligned segments.

score 2 · Answer 2 · 2010-11-16

Landan and Graur (2008) propose a simple reliability measure based on the discrepancy between the alignments produced by a set of sequences in the forward or reverse direction that you might want to consider:

"The proposed methodology is based upon the a priori expectation that sequence alignment results should be independent of the orientation of the input sequences. Thus, for totally unambiguous cases, reversing residue order prior to alignment should yield an exact reversed alignment of that obtained by using the unreversed sequences. Such “ideal” alignments, however, are the exception in real life settings, and the two alignments, which we term the heads and tails alignments, are usually different to a greater or lesser degree. The degree of agreement or discrepancy between these two alignments may be used to assess the reliability of the sequence alignment. Furthermore, any alignment dependent sequence analysis protocol can be carried out separately for each of the two alignments, and the two sets of results may be compared with each other, providing us with valuable information regarding the robustness of the whole analytical process. The heads-or-tails (HoT) methodology can be easily implemented for any choice of alignment method and for any subsequent analytical protocol."

Some further thoughts on the HoT method can be found on Thomas Mailund's blog.

score 1 · Answer 3 · 2010-11-15

1

Entering edit mode

13.4 years ago

Paulo Nuin ★ 3.7k

I would use Qscore from the same developer of Muscle. I used some quality metrics on my 2006 paper and then we used some lab developed software to calculate the scores. But I guess Qscore uses a similar metric.

ADD COMMENT • link 13.4 years ago by Paulo Nuin ★ 3.7k

1

Entering edit mode

Not exactly, but how do you measure quality without the "correct" alignment?

ADD REPLY • link 13.4 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

+1 This is interesting...

ADD REPLY • link 13.4 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

+1 I will try Qscore, but not sure if we have good "reference" alignments

ADD REPLY • link 13.4 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

That's one problem of any quality score for alignments. Depending on the type of sequences you are aligning you might need to rely on the best software available and then check by eye.

ADD REPLY • link 13.4 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

Qscore, SP and TC are pretty much everyone is using. But are these "ab initio" methods given that they all require a reference alignment?

ADD REPLY • link 13.4 years ago by lh3 33k

0

Entering edit mode

That is my answer below. We can evaluate trees instead of alignments. After all, MSA of protein sequences is mainly used for building trees. In addition, a few papers use secondary structure or protein domain overlapping to evaluate MSA, although I do not quite like it.

ADD REPLY • link 13.4 years ago by lh3 33k

0

Entering edit mode

And what is the "correct" tree? And alignments are not used only for trees.

ADD REPLY • link 13.4 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

First I did not say MSA is "only" used for trees. I know MSA can also be used for homology searches and so on. Second you should read my answer below. I made it clear how to evaluate a gene tree without knowing the "correct" tree, although this method is more useful to many gene trees. Why we have to know the "correct" answer which itself is questionable? There are ways to evaluate without knowing the truth.

ADD REPLY • link 13.4 years ago by lh3 33k

0

Entering edit mode

I read your answer below, and there's no "known phylogeny", phylogenies are just estimations. And even if you measure the "minimum number of duplication events" there's no guarantee that you have the "correct" or "known phylogeny". Trees and alignments are just estimations, you can measure the alignment quality or if the tree makes sense on the light of statistics, but you will never be 100% sure in any case.

ADD REPLY • link 13.4 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

I am a little surprised that you criticize my approach is not "100%" accurate -- you cannot guarantee the correctness of MSA, either. If it were me, I would say my method does not have enough power. If you criticized in that way, I would say having an independent measurement is necessary when every MSA program is optimized on the few benchmark data sets.

ADD REPLY • link 13.4 years ago by lh3 33k

score 1 · Answer 4 · 2010-11-16

If this is for protein MSA you could try the following structure-based methods that don't require a reference alignment (but obviously work only on data sets for which you have structures).

The Oxbench package (http://www.compbio.dundee.ac.uk/Software/Oxbench/oxbench.html) also includes a structure-based metric. Full disclosure: I work in the group that developed Oxbench.

T-Coffee www.tcoffee.org) includes the iRMSD metric for assessing alignment quality.

APDB (http://www.ncbi.nlm.nih.gov/pubmed/17032685)

score 1 · Answer 5 · 2010-11-16

1

Entering edit mode

13.4 years ago

Rm 8.3k

Either if you have structures for the protein sequences or if homologous to any PDB structures, you can try: "QUASAR—scoring and ranking of sequence–structure alignments"

PROMALS web server for accurate multiple protein sequence alignments.

ADD COMMENT • link 13.4 years ago by Rm 8.3k

0

Entering edit mode

+1 Thanks for the links

ADD REPLY • link 13.4 years ago by Aleksandr Levchuk 3.2k