Question: Ab Initio Unsupervised Methods For Inferring Quality Of Multiple Sequence Alignments
6
gravatar for Aleksandr Levchuk
9.1 years ago by
United States
Aleksandr Levchuk3.2k wrote:

MSA programs make mistakes. Even if no mistakes are made and output MSA is "optimal", we could still be looking at something Biologically meaningless (e.g. the input contains divergent sequences - evolutionary between input the sequences has been completely wiped-out by mutations because those genes were under different evolutionary pressure).

There seems to be 10 or so publications on automated MSA quality scoring.

EDIT (6 month later): Looks like I was wrong, it's not ~10 but ~4 unsupervised methods out there: norMD, Gblocks, HoT, and GUIDANCE. There is also PSAR that claims to be better than GUIDANCE but they only did testing on DNA sequences. And only a few MSA aligners report sites-specific confidence: SOAP, T-COFFEE, FSA.

Does anyone have experience on how these methods perform in practice?

quality alignment multiple • 2.7k views
ADD COMMENTlink modified 8.3 years ago by Casey Bergman18k • written 9.1 years ago by Aleksandr Levchuk3.2k

A simpler version of this question: Which Msa Scoring Methods Did You Use?

ADD REPLYlink modified 3 months ago by RamRS25k • written 8.3 years ago by Aleksandr Levchuk3.2k

A norMD question: Anyone Use Normd As A Quality Control For Msas

ADD REPLYlink modified 3 months ago by RamRS25k • written 8.3 years ago by Aleksandr Levchuk3.2k
4
gravatar for lh3
9.1 years ago by
lh331k
United States
lh331k wrote:

Sorry for my bad English, but I am not sure what you are asking. Are you asking "how can I compare MSA programs without knowing the reference alignment?", or asking "what MSA program should I use in practice?", or asking "how many methods are available to evaluate MSA?" To the third question, everyone is using pretty much the same set of measurements. To the second question, this page may help a little.

To the first question, if it were me, I would evaluate the resulting phylogenetic tree instead of the alignment. Given a set of genes from species with a known phylogeny, we can perform MSA with different programs, reconstruct trees with ML/phyml and then evaluate the accuracy of gene trees by counting the minimum number of duplication events that explain the gene evolution. This measurement is powerful, although some argue that the best MSA does not necessarily lead to the best tree.

EDIT: as I am now clearer about your question...

I know MSA is used for homology searches and other purposes, but my experiences mainly come from building gene trees (not species tree). I think gene trees are more informative than MSA. On a gene tree, a lot of weirdness (e.g. pseudogene and distant paralogs) stands out which you cannot easily see from MSA.

BTW, for my purpose (gene tree building) I do not care too much about the theoretical accuracy of these MSA programs. I only choose conservative columns and sometimes try a clustalx heuristic to screen out obviously misaligned segments.

ADD COMMENTlink modified 9.1 years ago • written 9.1 years ago by lh331k

+1 You English if perfect. Those are all important questions but I'm wondering about a more general question of quality. I added and intro paragraph to my question, hopefully it will better describe what I mean.

ADD REPLYlink written 9.1 years ago by Aleksandr Levchuk3.2k
2
gravatar for Casey Bergman
9.1 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

Landan and Graur (2008) propose a simple reliability measure based on the discrepancy between the alignments produced by a set of sequences in the forward or reverse direction that you might want to consider:

"The proposed methodology is based upon the a priori expectation that sequence alignment results should be independent of the orientation of the input sequences. Thus, for totally unambiguous cases, reversing residue order prior to alignment should yield an exact reversed alignment of that obtained by using the unreversed sequences. Such “ideal” alignments, however, are the exception in real life settings, and the two alignments, which we term the heads and tails alignments, are usually different to a greater or lesser degree. The degree of agreement or discrepancy between these two alignments may be used to assess the reliability of the sequence alignment. Furthermore, any alignment dependent sequence analysis protocol can be carried out separately for each of the two alignments, and the two sets of results may be compared with each other, providing us with valuable information regarding the robustness of the whole analytical process. The heads-or-tails (HoT) methodology can be easily implemented for any choice of alignment method and for any subsequent analytical protocol."

Some further thoughts on the HoT method can be found on Thomas Mailund's blog.

ADD COMMENTlink written 9.1 years ago by Casey Bergman18k

+1 Sounds very interesting.

ADD REPLYlink written 9.1 years ago by lh331k

+1 It's eligant but does not go very deep into disturbing the MSA program. The new method GUIDANCE was found to be much more effective than HoT. The authors describe it in a new paper (An Alignment Confidence Score Capturing Robustness to Guide Tree Uncertainty) where they perturbed the guide tree reveal the MSA program's uncertainty even more. The also developed a web service for this (http://guidance.tau.ac.il)

ADD REPLYlink written 9.1 years ago by Aleksandr Levchuk3.2k
1
gravatar for Paulo Nuin
9.1 years ago by
Paulo Nuin3.7k
Canada
Paulo Nuin3.7k wrote:

I would use Qscore from the same developer of Muscle. I used some quality metrics on my 2006 paper and then we used some lab developed software to calculate the scores. But I guess Qscore uses a similar metric.

ADD COMMENTlink written 9.1 years ago by Paulo Nuin3.7k
1

Not exactly, but how do you measure quality without the "correct" alignment?

ADD REPLYlink written 9.1 years ago by Paulo Nuin3.7k

+1 This is interesting...

ADD REPLYlink written 9.1 years ago by Aleksandr Levchuk3.2k

+1 I will try Qscore, but not sure if we have good "reference" alignments

ADD REPLYlink written 9.1 years ago by Aleksandr Levchuk3.2k

That's one problem of any quality score for alignments. Depending on the type of sequences you are aligning you might need to rely on the best software available and then check by eye.

ADD REPLYlink written 9.1 years ago by Paulo Nuin3.7k

Qscore, SP and TC are pretty much everyone is using. But are these "ab initio" methods given that they all require a reference alignment?

ADD REPLYlink written 9.1 years ago by lh331k

That is my answer below. We can evaluate trees instead of alignments. After all, MSA of protein sequences is mainly used for building trees. In addition, a few papers use secondary structure or protein domain overlapping to evaluate MSA, although I do not quite like it.

ADD REPLYlink written 9.1 years ago by lh331k

And what is the "correct" tree? And alignments are not used only for trees.

ADD REPLYlink written 9.1 years ago by Paulo Nuin3.7k

First I did not say MSA is "only" used for trees. I know MSA can also be used for homology searches and so on. Second you should read my answer below. I made it clear how to evaluate a gene tree without knowing the "correct" tree, although this method is more useful to many gene trees. Why we have to know the "correct" answer which itself is questionable? There are ways to evaluate without knowing the truth.

ADD REPLYlink written 9.1 years ago by lh331k

I read your answer below, and there's no "known phylogeny", phylogenies are just estimations. And even if you measure the "minimum number of duplication events" there's no guarantee that you have the "correct" or "known phylogeny". Trees and alignments are just estimations, you can measure the alignment quality or if the tree makes sense on the light of statistics, but you will never be 100% sure in any case.

ADD REPLYlink written 9.1 years ago by Paulo Nuin3.7k

I am a little surprised that you criticize my approach is not "100%" accurate -- you cannot guarantee the correctness of MSA, either. If it were me, I would say my method does not have enough power. If you criticized in that way, I would say having an independent measurement is necessary when every MSA program is optimized on the few benchmark data sets.

ADD REPLYlink written 9.1 years ago by lh331k
1
gravatar for Tom Walsh
9.1 years ago by
Tom Walsh550
United Kingdom
Tom Walsh550 wrote:

If this is for protein MSA you could try the following structure-based methods that don't require a reference alignment (but obviously work only on data sets for which you have structures).

The Oxbench package (http://www.compbio.dundee.ac.uk/Software/Oxbench/oxbench.html) also includes a structure-based metric. Full disclosure: I work in the group that developed Oxbench.

T-Coffee www.tcoffee.org) includes the iRMSD metric for assessing alignment quality.

APDB (http://www.ncbi.nlm.nih.gov/pubmed/17032685)

ADD COMMENTlink written 9.1 years ago by Tom Walsh550
1
gravatar for Rm
9.1 years ago by
Rm7.9k
Danville, PA
Rm7.9k wrote:

Either if you have structures for the protein sequences or if homologous to any PDB structures, you can try: "QUASAR—scoring and ranking of sequence–structure alignments"

PROMALS web server for accurate multiple protein sequence alignments.

ADD COMMENTlink written 9.1 years ago by Rm7.9k

+1 Thanks for the links

ADD REPLYlink written 9.1 years ago by Aleksandr Levchuk3.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2054 users visited in the last hour