Background
Earlier I asked a question on how to measure the "quality" of a multiple sequence alignment: Ab initio methods for inferring quality of Multiple Sequence Alignments
...there is also a duplicate: Multiple sequence alignment score
...and a different question about MSA similarity score: Similarity score of Multiple Sequence Alignment
So I have some tools to measure how "good" the multiple alignments are.
I can also make the MSAs "better" by using the scoring tools in the following way:
- Measure a score of the initial MSA.
- Remove the first sequence from the MSA, re-align, and re-measure the score
- If the score got worse then put the sequence back
- If the removal of all sequences has been tried then STOP, otherwise go to step 2
This should work because of the garbage-in-garbage-out nature of MSA. Hopefully, if I filter out the input, making it non-garbage then I should get a "good" output even for distantly related genes.
The Question
How can I verify that the MSAs actually did get better?
I'm interested in both closely and distantly related groups of proteins.
Things that I tried
...tried to think about.
- Verifying against the tiny fraction of groups of proteins that are know to be related to each other based on evidence of 3D structure superpositioning.
- Building a phylogenetic tree from the MSA and verifying the tree against know taxonomy of species. Using a simple rule (assumption) that most genes, unless they were horizontal transfered should have the same species ancestry as the whole organize.