Question: Neighbor Joining algorithm: why n-2?
0
2.4 years ago by
michael0
michael0 wrote:

I have been looking into how the neighbor joining algorithm works and I have two questions about the first step, i.e. computing the matrix Q using the following formula:

Q(i,j) = (n - 2)d(i,j) - Σd(i,k) - Σd(j,k)

Where both summations are from k = 1 to n, n is the number of taxons, d() is the distance between two elements and i and j are specific taxons.

From my understanding this formula creates a new matrix containing the distance between leaves i and j multiplied by (n-2). From this value we subtract the total distance from i to all other nodes and the total distance from j to all other nodes. This gives us a sense of how far away any given pair of nodes is from all other clusters.

I think we are using the term (n-2) as a scaling factor to ensure that the value calculated by the two sums does not dominate the result. If this is the case, why is it n-2 and not n-1 or just n?

Finally, I have also encountered the following formula for Q:

Q(i,j) = d(i,j) - 1/(n-2)Σd(i,k) - 1/(n-2)Σd(j,k)

Are these two formulas equivalent? If so why?

Thank you.

modified 2.4 years ago by kloetzl1.1k • written 2.4 years ago by michael0

This is not a scaling factor, it comes from doing the sum over all leaves except i and j.

2
2.4 years ago by
kloetzl1.1k
European Union
kloetzl1.1k wrote:

All of my references have a different notation, but I will try to answer anyway.

If this is the case, why is it n-2 and not n-1 or just n?

You kind of answered the question yourself in the paragraph above. You want (try to) to fuse two nodes `i` and `j` into a new node `p`. By doing that you remove one edge from the tree, hence `n-1`. Knock off another `-1` for each "self distance" i.e. value of 0 in the sum, thus `n-2`.

Are these two formulas equivalent? If so why?

They differ only by a factor of `n-2`. However, what it most important is to join the nodes with the minimum distance; and that doesn't change.