Question

UPGMA method when tree leaves have different known ages

0

Entering edit mode

6.9 years ago

BlastedBadger ▴ 160

Hi,

I have gene trees, with molecular distances. In these gene trees, I have speciation nodes, whose absolute age is known. I also have gene duplication nodes, whose absolute age is unknown. These molecular trees are highly not ultrametric (meaning there isn't the same distance between leaf and root, depending on the leaf).

I would like to infer an approximation of the age of duplication nodes. So I started with a crude method that looks like the UPGMA (actually, it's closer to WPGMA but the idea is similar): I start from a speciation, and I climb down the tree: each time I find a duplication node, I average the descendant branch lengths. Then I can scale this new ultrametric distance value using absolute ages (of speciation).

However, there are also missing speciations in these trees... so when I find a duplication node, it is possible that one descendant node is a speciation at time t1, and the other descendant would be another speciation at time t2 ≠ t1.

I will make a little sketch to illustrate this:

if there were all speciation nodes, I would try to reconstruct an ultrametric tree that looks like this:
```
         |------ S1
    |----|
    |    |------ S1
----|
    |----------- S1
```
but because there are missing speciation nodes, we want to reconstruct a tree that should be like this:
```
         |------ S1
    |----|
    |    |------ S1
----|
    |------------------- S2
```

Is there a simple adaptation I can apply to my method to take this into account?

Now I know there are sophisticated methods to infer divergence times from molecular data, but some of them are parametric methods (with varying rates of evolution, and likelihood inference) that seem too computationally intensive for the number of trees I have to process. I am currently reading Sanderson 1997 (a nonparametric method) and Sanderson 2002 (a semi-parametric method) to see if I could apply these, but right now I'd prefer to start simple and fast. However I am happy if you suggest me state-of-the-art methods or reviews on multiple methods :)

phylogeny dating UPGMA molecular clock time-tree • 1.8k views

ADD COMMENT • link updated 6.9 years ago by Michael 54k • written 6.9 years ago by BlastedBadger ▴ 160

1

Entering edit mode

How many time-trees do you have to build? I have seen a lot of publications using BEAST

ADD REPLY • link 6.9 years ago by Michael 54k

0

Entering edit mode

I am using gene trees from Ensembl, that I cut at family level. So for example, I have around 15000 trees for Rodentia. They can have from 5 to 50 leaves (most of them between 5 and 10 I guess, but I didn't check). Yes BEAST would be the canonical tool for age inference I suppose, I need to get to know it...

ADD REPLY • link 6.9 years ago by BlastedBadger ▴ 160

0

Entering edit mode

Do you have some molecular clock data for calibration? But I guess 15000 trees is a lot to run through an MCMC method. Otherwise I don't have experience with this unfortunately.

ADD REPLY • link 6.9 years ago by Michael 54k

1

Entering edit mode

For the dataset being considered here, i.e., 5 to 50 sequences (leaves) I think it can be processed quickly, of course a server would be advantageous. I have had datasets with 200 to 400 sequences and was able to run them in 24 hours for 200 million generations. Also for the purposes of the year separation, I guess smaller generations can fasten the process.

ADD REPLY • link 6.9 years ago by sridhar56 ▴ 110