Question: Comparing sequence alignments using phylogenetic likelihoods?
0
4.8 years ago by
db5340
United Kingdom
db5340 wrote:

Hello,

I am currently considering different multiple sequence alignment (MSA) algorithms that I could use to generate an alignment. For the comparison of different alignments (i.e. produced from different tools), I have heard it said that phylogenetic likelihoods could be used to compare different alignments. That is, one could generate different alignments, use phylogenetic software to generate a maximum likelihood value for each, and the alignment yielding the best likelihood chosen for downstream analyses.

I'm not sure if this would be valid, though. My thinking is that different tools would produce alignments of different lengths -- given differing propensities for each algorithm to force a gap -- but surely alignments of different lengths are not comparable using likelihoods, given that likelihood for the alignment represents the product (sum for log-likelihood) of all of the site likelihoods? Can anyone confirm this or add their general thoughts?

alignment • 1.1k views
modified 4.8 years ago • written 4.8 years ago by db5340
0
4.8 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche21k wrote:

Given a phylogenetic tree, you can calculate the probability of an alignment being generated by this tree. So if you have a model of how your sequences are related (i.e. a tree), you can compare the alignments.

The probability of an alignment being generated by this tree -- you are referring to the likelihood of the alignment with respect to the tree, right? I'm thinking that different alignments are not comparable using phylogenetic likelihoods because different alignment tools will produce alignments of different lengths. This is a problem because the sequence likelihood is calculated by summing the log-likelihoods at each site; longer sequences should therefore have more negative log-likelihoods, everything else being equal. I suppose you could try normalising the likelihood value by sequence length, but I'm not sure how valid this would be.

It is perfectly valid to compare the alignments given the hypothesis that the true alignment should come from a given tree. Alignments with more information will indeed have a lower log-likelihood under any model but I think it depends on what you want to do with your alignment in the end. If you want to build a tree out of it then you don't want to use a tree as prior knowledge. A possible alternative approach could be to evaluate similarity of the alignments to a given reference sequence. Otherwise, without ground truth or prior knowledge on what the true alignment should look like, you have to use a data-driven approach, that is, use information from the alignments themselves. There are papers on this e.g. Thompson et al. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999 Jul 1;27(13):2682-90 and Thompson et al. Towards a reliable objective function for multiple sequence alignments. J Mol Biol. 2001 Dec 7;314(4):937-51.

Thank you for reply. Indeed I am going to build a tree so I can see now that using a tree to guide selection of the alignment would be a problem. Speaking academically though, my point about lower likelihood values though is that it stops alignments of different lengths being comparable -- if a shorter alignment has a higher likelihood than a longer alignment, is this because the former is better aligned or because it is shorter? So how would I know which alignment to select going forward?

Thanks for the links. I'm currently looking into norMD as an option. There are also plenty of structures available for the proteins that I am looking, so that could guide assessment of the "best" alignment. Cheers.