Question: Distance Tree For One Million Sequences
gravatar for Rvosa
8.8 years ago by
Leiden, the Netherlands
Rvosa570 wrote:

Dear biostar,

if someone has an enormous table with pairwise distances between 1 million protein sequences, and he wants to build a tree out of that (say, using neighbor joining), is there a way to do that?



ADD COMMENTlink modified 8.7 years ago by Andreas2.4k • written 8.8 years ago by Rvosa570

What is your input? A 1M*1M matrix? That is going to be 1 terabyte at least! If you want to cluster proteins, there are specialized software to do that.

ADD REPLYlink written 8.8 years ago by lh331k

As lh3 notes, it's very memory intensive, because of the quadratic distance matrix. In addition I would very much question if this task makes sense. Often, the intention to cluster so many objects arises from poor filtering, lack of hypothesis or misunderstanding of the problem. So I doubt that you really want to do what you say you want to do.

ADD REPLYlink written 8.8 years ago by Dr. Mabuse47k

I'd agree with Michael's comment. Are you sure that you want to build such a tree? What would you do with it? It would be pretty much impossible to visualize. Are there even tools available to navigate around/analyze such a tree or subsets of it?

ADD REPLYlink written 8.8 years ago by Neilfws48k
gravatar for Dave Lunt
8.8 years ago by
Dave Lunt2.0k
Hull, UK
Dave Lunt2.0k wrote:

There are lots of very sensible comments already to this, let me add one, then ignore it and carry on! Price et al say

"The distance matrix for families with 100,000–200,000 members requires 20–80 GB of memory to store (a 4-byte floating-point value for each of N(N −1)/2 pairs). Although computers with this much memory are available, the typical node in a compute cluster has an order of magnitude less memory."

Lets not be too sensible though, I want to see a tree of 1 million protein sequences, I think it would be both useful and fun! I don't think it is impossible, even today with some lateral thinking. I don't think it has ever been done though, so there is no pre-existing methodology. If the distance matrix already exists there might be a way to handle it without loading everything into memory. The relaxed NJ methods of FastTree and ClearCut programes might be a way forward, or maybe not.

The FastTree website says "FastTree can handle alignments with up to a million of sequences in a reasonable amount of time and memory." So maybe just start again with an alignment could be an option?

How to visualize it would certainly be a challenge. But two points; firstly we don't have to visualize it get get information, interesting tree stats can be extracted without seeing anything. Secondly, maybe it can be visualized. There is a decent amount of research on this topic- things that don't just throw the whole tree at you, but allow real time zooming. Also I have seen some cool images from Walrus graph visualization tool with about 500k tips. I have an (untested) suspicion that ARB could also handle a tree of 1 million tips (as long as you didn't have to make the tree in the programme.

Its not going to be easy, but don't lets conclude too fast that it can't be done.

ADD COMMENTlink written 8.8 years ago by Dave Lunt2.0k

+1 for positive encouragement to push the envelope of what is possible.

ADD REPLYlink written 8.8 years ago by Casey Bergman18k

+1 for mentioning fasttree. On the other hand, I guess rvosa was thinking about protein clustering but reduced this problem to a much harder problem: building a huge tree. The more appropriate formulation is clustering on a sparse graph. As to fasttree, it seems to me that all its advantage lies in that it does not use a distance matrix (I could be wrong). But given a million proteins, it is pretty hopeless to get a multialignment. Fasttree is great, but might not be helpful here.

ADD REPLYlink written 8.8 years ago by lh331k
gravatar for Pablo Pareja
8.8 years ago by
Pablo Pareja1.6k
Granada, Spain
Pablo Pareja1.6k wrote:

Hi Rutger,

I'm not sure if this could be an option for you but you could use a graph-based DB system like Neo4j for storing all this data.

Compared to relational DB systems where information must be flattened up in tables, with a graph-based DB approach you can carry out many processes/algorithms that directly are impossible with the other option.



ADD COMMENTlink written 8.8 years ago by Pablo Pareja1.6k
gravatar for Botond Sipos
8.8 years ago by
Botond Sipos1.7k
United Kingdom
Botond Sipos1.7k wrote:

Check out RAPIDNJ by Mailund et al. and this related article addressing the problem of large NJ trees.

ADD COMMENTlink written 8.8 years ago by Botond Sipos1.7k
gravatar for Phis
8.8 years ago by
Phis1.0k wrote:

In principle, it shouldn't be a problem, only in practice - I agree with Dave's answer: the memory requirements will be so large that in practice, you won't be able to do it on most currently available hardware. At least if you use the "naive" textbook approach of doing it. But I'm sure it should be possible to come up with some dark trickery to do it nonetheless:

Now, I guess this may not be the answer to your exact question, but very large trees have already been reconstructed. Specifically, I heard a talk by Stephen A. Smith recently. He showed a phylogenetic analysis he had recently done, using 100 000 taxa, based on DNA sequence data and likelihood methods, if I remember correctly. I therefore have a vague feeling that you may be interested in the stuff he's been doing.

I don't know whether he used this program for his analysis, but certainly during his talk he mentioned his software PHLAWD.

Hope that helps

ADD COMMENTlink written 8.8 years ago by Phis1.0k
gravatar for Andreas
8.8 years ago by
Andreas2.4k wrote:

So you already have those distances precomputed? Computing the distances alone will be a big bottle-neck. For clustering, my first thought was also FastTree, but I don't think you can input a distance matrix. FastTree computes the pairwise distances from an alignment. And even if you could, there is no way you could fit a 1e6x1e6/2 distance matrix into memory.

Have a look at MC-UPGMA which is a memory-restrained UPGMA implementation. They clustered all of UniProt as an example in their publication. Not sure how flexible it is though (never tried it myself).

I would be interested to see how you are going to visualise that tree...


ADD COMMENTlink modified 8.8 years ago • written 8.8 years ago by Andreas2.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1699 users visited in the last hour