Is that possible to use Neighbor Joining algorithm for a distance matrix with more than 1M taxa
1
0
Entering edit mode
5.6 years ago
ajingnk ▴ 130

I have a distance matrix with more than 1M nodes. I wonder whether there exist any Neighbour Joining implementation which could calculate a tree structure within a reasonable amount of time. The complexity for Neighbour Joining is N^3, so I am not sure whether there is any heuristic algorithm could do this.

Currently, I have a distance matrix. If there is another comparable tree building algorithm starting from distance matrix, I would like to try that method also.

phylo tree • 3.0k views
1
Entering edit mode

1M*1M=1TB. If you have a matrix in float, that is 4TB. An algorithm are likely to keep a few such matrices. Do you have a machine with ~10TB RAM? If your main purpose is clustering, there are more appropriate algorithms than NJ.

0
Entering edit mode

There are in fact efficient algorithms that do not require having the entire matrix loaded into memory.

0
Entering edit mode

I haven't seen a popular implementation of hierarchical clustering with complexity less than N^2

0
Entering edit mode

What will you do next after you get the tree? If you are to cluster nodes into groups, you can do clustering directly without a tree. There are many clustering algorithms. Hierarchical clustering is just a very basic one.

0
Entering edit mode

I need the tree structure for visualization.

0
Entering edit mode

Suppose you managed to do the clustering. How do you imagine showing the hierarchical structure of a tree with 1M leaf nodes? It sounds to me like your actual problem is much more about big-data visualization than it is about the computational challenge.

0
Entering edit mode

A tree with 1M leaf nodes is about 2M nodes and 2M edges. I was thinking to use Gephi.. But, I guess I should reduce the size of the data first.

0
Entering edit mode

If the starting point is an all-against-all similarity matrix, it is theoretically impossible to do better than N^2, since just reading through the similarity matrix is N^2.

0
Entering edit mode

Yes, you are right. N^2 is the best the algorithm can do.

1
Entering edit mode
5.6 years ago

It is pushing the limits but likely possible. Although the worst-case performance of Neighbour Joining is N^3, there are heuristics that evolve N^2 on most actual data. I would take a look at RapidNJ and related work by the same group. NINJA might be a viable option too. Extrapolating based on N^2 scaling and a runtime of 100 seconds for 8000 taxa, you are looking at runtime in the order of a few weeks.

0
Entering edit mode

Thanks Lars! So, RapidNJ and NINJA take care of the memory issue also?

0
Entering edit mode

I am not sure. I think eRapidNJ (the paper I linked to) is where it was introduced, but I have not found the code. If RapidNJ takes too much memory, I would suggest you contact the authors of the eRapidNJ article.

0
Entering edit mode

The paper said that eRapidNJ still loads "the lower triangular matrix" into RAM. That is 2TB.

0
Entering edit mode

The authors state that *first* they reduce the memory requirement by a factor of two by storing only the lower triangular matrix. On the next page they explain how they further reduce the memory consumption by not storing only some columns in memory at a time.

0
Entering edit mode

My impression is the column-only trick is only applied to S, not the distance matrix D? "The size of S is reduced by only storing as many columns of S as can fit in the available internal memory after D has been loaded."

0
Entering edit mode

You may be right. However, one thing really mystifies me: in the abstract the authors state that the N^2 memory consumption becomes a problem for large trees, and that the new algorithm is applicable to very large trees. That does not make much sense if the new algorithm also requires N^2 memory, does it?