Comparing sequence alignments using phylogenetic likelihoods?
1
0
Entering edit mode
8.9 years ago
db534 ▴ 10

Hello,

I am currently considering different multiple sequence alignment (MSA) algorithms that I could use to generate an alignment. For the comparison of different alignments (i.e. produced from different tools), I have heard it said that phylogenetic likelihoods could be used to compare different alignments. That is, one could generate different alignments, use phylogenetic software to generate a maximum likelihood value for each, and the alignment yielding the best likelihood chosen for downstream analyses.

I'm not sure if this would be valid, though. My thinking is that different tools would produce alignments of different lengths -- given differing propensities for each algorithm to force a gap -- but surely alignments of different lengths are not comparable using likelihoods, given that likelihood for the alignment represents the product (sum for log-likelihood) of all of the site likelihoods? Can anyone confirm this or add their general thoughts?

alignment • 1.8k views
ADD COMMENT
0
Entering edit mode
8.9 years ago

Given a phylogenetic tree, you can calculate the probability of an alignment being generated by this tree. So if you have a model of how your sequences are related (i.e. a tree), you can compare the alignments.

ADD COMMENT
0
Entering edit mode

The probability of an alignment being generated by this tree -- you are referring to the likelihood of the alignment with respect to the tree, right? I'm thinking that different alignments are not comparable using phylogenetic likelihoods because different alignment tools will produce alignments of different lengths. This is a problem because the sequence likelihood is calculated by summing the log-likelihoods at each site; longer sequences should therefore have more negative log-likelihoods, everything else being equal. I suppose you could try normalising the likelihood value by sequence length, but I'm not sure how valid this would be.

ADD REPLY
0
Entering edit mode

It is perfectly valid to compare the alignments given the hypothesis that the true alignment should come from a given tree. Alignments with more information will indeed have a lower log-likelihood under any model but I think it depends on what you want to do with your alignment in the end. If you want to build a tree out of it then you don't want to use a tree as prior knowledge. A possible alternative approach could be to evaluate similarity of the alignments to a given reference sequence. Otherwise, without ground truth or prior knowledge on what the true alignment should look like, you have to use a data-driven approach, that is, use information from the alignments themselves. There are papers on this e.g. Thompson et al. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999 Jul 1;27(13):2682-90 and Thompson et al. Towards a reliable objective function for multiple sequence alignments. J Mol Biol. 2001 Dec 7;314(4):937-51.

ADD REPLY
0
Entering edit mode

Thank you for reply. Indeed I am going to build a tree so I can see now that using a tree to guide selection of the alignment would be a problem. Speaking academically though, my point about lower likelihood values though is that it stops alignments of different lengths being comparable -- if a shorter alignment has a higher likelihood than a longer alignment, is this because the former is better aligned or because it is shorter? So how would I know which alignment to select going forward?

Thanks for the links. I'm currently looking into norMD as an option. There are also plenty of structures available for the proteins that I am looking, so that could guide assessment of the "best" alignment. Cheers.

ADD REPLY

Login before adding your answer.

Traffic: 1123 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6