How to interpret the log-likelihood values of phyml results?
1
1
Entering edit mode
2.1 years ago
changxu.fan ▴ 70

Dear community,

I ran phyml on a gene family to build a tree. Looking at the results, I'm a bit worried about the log-likelihood value: it's -754, which means the likelihood is almost zero! Does this mean that the program has little confidence in the estimated parameters or the tree topology? I was wondering if I'm understanding this incorrectly.

Thank you so much!!


 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
                                  ---  PhyML 3.3.20190909  ---                                             
                              http://www.atgc-montpellier.fr/phyml                                          
                             Copyright CNRS - Universite Montpellier                                 
 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo

. Sequence filename:            exon3_wb_aligned_phy
. Data set:                 #1
. Initial tree:             BioNJ
. Model of nucleotides substitution:    GTR
. Number of taxa:           52
. Log-likelihood:           -754.13849
. Unconstrained log-likelihood:     -348.41766
. Composite log-likelihood:         -6438.01266
. Parsimony:                119
. Tree size:                1.35199
. Discrete gamma model:         Yes
  - Number of classes:          4
  - Gamma shape parameter:      1.901
  - Relative rate in class 1:       0.28116 [freq=0.250000]         
  - Relative rate in class 2:       0.64406 [freq=0.250000]         
  - Relative rate in class 3:       1.06730 [freq=0.250000]         
  - Relative rate in class 4:       2.00748 [freq=0.250000]         
. Nucleotides frequencies:
  - f(A)=  0.37232
  - f(C)=  0.24092
  - f(G)=  0.17327
  - f(T)=  0.21350
. GTR relative rate parameters :
  A <-> C    0.82212
  A <-> G    1.82689
  A <-> T    0.53724
  C <-> G    0.17829
  C <-> T    2.00016
  G <-> T    1.00000
. Instantaneous rate matrix : 
  [A---------C---------G---------T------]
  -0.82453   0.25951   0.41474   0.15028  
   0.40104  -1.00102   0.04048   0.55950  
   0.89119   0.05628  -1.22720   0.27973  
   0.26208   0.63136   0.22702  -1.12046  


. Run ID:               none
. Random seed:              1625516914
. Subtree patterns aliasing:        no
. Version:              3.3.20190909
. Time used:                0h0m4s (4 seconds)

 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
 Suggested citations:
 S. Guindon, JF. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, O. Gascuel
 "New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0."
 Systematic Biology. 2010. 59(3):307-321.

 S. Guindon & O. Gascuel
 "A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood"
 Systematic Biology. 2003. 52(5):696-704.
 oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooo
phyml • 968 views
ADD COMMENT
2
Entering edit mode
2.1 years ago
Mensur Dlakic ★ 27k

The principle of maximum likelihood is to choose the tree which makes the data most probable. As tree probabilities are usually tiny, especially for large datasets, we express them as ln(P), which is the log likelihood (LL). LL is a negative number (log function is negative in the 0-1 range), and the best it can be is 0 (when the probability is 1).

Seems like you have a smallish tree, and the LL you obtained is appropriate. As of this writing I am monitoring an ongoing large tree that has LL=-978001.783, so you are golden. LL values are comparable for the same dataset (alignment), but not between different datasets. Higher LL is better.

ADD COMMENT
0
Entering edit mode

Thanks for the reply! May I follow up with a question? Since the LL value is dataset dependent, for a particular dataset, how do I know that my LL is good enough?

Thanks again!!

ADD REPLY
0
Entering edit mode

Like with any other sampling method that incompletely covers the total event space, you can never be sure that your LL is the best it can be. After all, there is no target number that is known ahead of time.

One way around it is to run multiple tree reconstructions (at least 100, and 1000 is even better), and to calculate bootstrap support for each tree branch. Opinions vary, but most people would probably agree that branches with >70-80% bootstrap support are reliable.

Yet another way is to do a Bayesian analysis, which runs at least two independent tree reconstructions for a very long number of sampling generations (at least a million, but more is better). If they independently converge to a similar LL value, that would support the idea that the resulting tree is close to a global maximum of LL. There is a quantity called standard deviation of split frequencies (SDSF) that tells you how well the tree reconstructions match each other. SDSF converges to 0 when tree reconstructions are identical, but for practical purposes SDSF < 0.01 is accepted as a sign of convergence. Independently, Bayesian methods will give posterior probabilities (0-1 scale) to each branch, and their meaning is comparable to bootstrap support in ML methods.

ADD REPLY

Login before adding your answer.

Traffic: 1480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6