Question: Robinson-Foulds Distance : Number Of Trees At Distance K From A Fixed Tree
0
manuellafond0 wrote:

Hi,

This is for members who might be knowledgeable about the Robinson-Foulds distance. I have a fixed tree T and a bunch of other trees B.

I use the Robinson-Foulds (RF) distance to compare the trees in B with T, and I get a distribution of distances. I'd like to show that this distribution is significantly different from random. In other words, my null hypothesis is that there is no relationship between the trees of B and the RF-distance. In order to contradict this hypothesis, it would help to know how the trees distances from T are distributed if I include all the possible trees.

Hence the question : is it possible to know the number of trees at distance k from a fixed tree T ?

phylogeny tree • 2.8k views
modified 11 months ago by leonardini0 • written 6.3 years ago by manuellafond0
0
Michael Dondrup47k wrote:

This problem should be accessible to full enumeration as the number of different tree topologies is finite. You have to impose two further constraints though:

• The number of nodes (e.g. |B| := |T| for all pairs (B, T), or a range) in the tree
• The maximum number of child-nodes per node (or unlimited)

Then, you can enumerate all possible trees with these properties (excluding symmetric trees) and compute the RF-distance to T for each. With this you can compute the discrete probability (and therefore a P-value) to get an observation RF(B,T) = k or anything more extreme, provided all trees have the same prior probability of occurring.

This gets more complicated if there should be an arbitrary number of nodes per tree, one might ask how much sense it makes to compare a tree with 1000 nodes with one having 10 nodes. So, I do not know how to solve this but doubt that taking into account different node counts makes sense.

Anyway, the number of trees that can contribute to achieve a distance that is less or equal to k is finite as well. Assume for example, that adding or deleting a node is defined for the metric and has constant cost c := 1 adding to the distance, then you only need to take into account those trees which have N nodes such that |T|-k <= N <= |T| + k (with |T| : number of nodes of tree T).

By the way, while this solves the original problem, I don't think it has any practical application to phylogentics, because:

• ignores branch depth
• "provided all trees have the same prior probability of occurring" which is obviously not the case

Yes thank you. I didn't go into the details but ignoring branch lengths and possible trees should be sufficient for the analysis at hand. I also forgot to mention that all the trees have the same number of nodes. The real question, which I should have made more precise, was "Can we find the number of trees at distance k analytically, purely by mathematical methods and without resorting to generating all non-isomorphic trees ?". But after searching and asking around, I'm getting pretty convinced that this hasn't been solved and is likely to be a tough problem. So I'll probably do as you suggest. Thank you for your answer.

While the number of tree topologies is finite, it becomes intractable for even modest taxon sets. Felsenstein (1978; http://tinyurl.com/zdnsn48) gives the equation for the number of labelled bifurcating rooted trees for n taxa:

(2n-3)! / [2^(n-2) * (n-2)!]

The number of unrooted trees for x taxa can be obtained from the equation above by setting n = x - 1.

For example, for just 10 taxa the number of unrooted trees is 2,027,025 and the number of rooted tree is 34,459,425.

0
leonardini0 wrote:

This paper might be of interest: https://arxiv.org/abs/0810.0868. And we have an much faster, nearly cubic-time, implementation here: https://github.com/WGS-TB/RFDistribution (please contact me if you decide to use it and want to cite it).