Question: UPGMA method when tree leaves have different known ages
gravatar for BlastedBadger
22 months ago by
BlastedBadger70 wrote:


I have gene trees, with molecular distances. In these gene trees, I have speciation nodes, whose absolute age is known. I also have gene duplication nodes, whose absolute age is unknown. These molecular trees are highly not ultrametric (meaning there isn't the same distance between leaf and root, depending on the leaf).

I would like to infer an approximation of the age of duplication nodes. So I started with a crude method that looks like the UPGMA (actually, it's closer to WPGMA but the idea is similar): I start from a speciation, and I climb down the tree: each time I find a duplication node, I average the descendant branch lengths. Then I can scale this new ultrametric distance value using absolute ages (of speciation).

However, there are also missing speciations in these trees... so when I find a duplication node, it is possible that one descendant node is a speciation at time t1, and the other descendant would be another speciation at time t2 ≠ t1.

I will make a little sketch to illustrate this:

  • if there were all speciation nodes, I would try to reconstruct an ultrametric tree that looks like this:

             |------ S1
        |    |------ S1
        |----------- S1
  • but because there are missing speciation nodes, we want to reconstruct a tree that should be like this:

             |------ S1
        |    |------ S1
        |------------------- S2

Is there a simple adaptation I can apply to my method to take this into account?

Now I know there are sophisticated methods to infer divergence times from molecular data, but some of them are parametric methods (with varying rates of evolution, and likelihood inference) that seem too computationally intensive for the number of trees I have to process. I am currently reading Sanderson 1997 (a nonparametric method) and Sanderson 2002 (a semi-parametric method) to see if I could apply these, but right now I'd prefer to start simple and fast. However I am happy if you suggest me state-of-the-art methods or reviews on multiple methods :)

ADD COMMENTlink modified 22 months ago by Michael Dondrup46k • written 22 months ago by BlastedBadger70

How many time-trees do you have to build? I have seen a lot of publications using BEAST

ADD REPLYlink written 22 months ago by Michael Dondrup46k

I am using gene trees from Ensembl, that I cut at family level. So for example, I have around 15000 trees for Rodentia. They can have from 5 to 50 leaves (most of them between 5 and 10 I guess, but I didn't check). Yes BEAST would be the canonical tool for age inference I suppose, I need to get to know it...

ADD REPLYlink written 22 months ago by BlastedBadger70

Do you have some molecular clock data for calibration? But I guess 15000 trees is a lot to run through an MCMC method. Otherwise I don't have experience with this unfortunately.

ADD REPLYlink written 22 months ago by Michael Dondrup46k

For the dataset being considered here, i.e., 5 to 50 sequences (leaves) I think it can be processed quickly, of course a server would be advantageous. I have had datasets with 200 to 400 sequences and was able to run them in 24 hours for 200 million generations. Also for the purposes of the year separation, I guess smaller generations can fasten the process.

ADD REPLYlink written 22 months ago by sridhar56100
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 679 users visited in the last hour