Neighbor Joining algorithm: why n-2?
1
0
Entering edit mode
5.2 years ago
michael • 0

I have been looking into how the neighbor joining algorithm works and I have two questions about the first step, i.e. computing the matrix Q using the following formula:

Q(i,j) = (n - 2)d(i,j) - Σd(i,k) - Σd(j,k)

Where both summations are from k = 1 to n, n is the number of taxons, d() is the distance between two elements and i and j are specific taxons.

From my understanding this formula creates a new matrix containing the distance between leaves i and j multiplied by (n-2). From this value we subtract the total distance from i to all other nodes and the total distance from j to all other nodes. This gives us a sense of how far away any given pair of nodes is from all other clusters.

I think we are using the term (n-2) as a scaling factor to ensure that the value calculated by the two sums does not dominate the result. If this is the case, why is it n-2 and not n-1 or just n?

Finally, I have also encountered the following formula for Q:

Q(i,j) = d(i,j) - 1/(n-2)Σd(i,k) - 1/(n-2)Σd(j,k)

Are these two formulas equivalent? If so why?

Thank you.

neighbor-joining phylogenetics evolution • 2.0k views
0
Entering edit mode

This is not a scaling factor, it comes from doing the sum over all leaves except i and j.

2
Entering edit mode
5.2 years ago
kloetzl ★ 1.1k

All of my references have a different notation, but I will try to answer anyway.

If this is the case, why is it n-2 and not n-1 or just n?

You kind of answered the question yourself in the paragraph above. You want (try to) to fuse two nodes i and j into a new node p. By doing that you remove one edge from the tree, hence n-1. Knock off another -1 for each "self distance" i.e. value of 0 in the sum, thus n-2.

Are these two formulas equivalent? If so why?

They differ only by a factor of n-2. However, what it most important is to join the nodes with the minimum distance; and that doesn't change.