Question

Distance Tree For One Million Sequences

3

Entering edit mode

13.2 years ago

Rvosa ▴ 580

Dear biostar,

if someone has an enormous table with pairwise distances between 1 million protein sequences, and he wants to build a tree out of that (say, using neighbor joining), is there a way to do that?

Thanks,

Rutger

distance phylogenetics protein sequence • 3.7k views

ADD COMMENT • link updated 13.1 years ago by Andreas ★ 2.5k • written 13.2 years ago by Rvosa ▴ 580

5

Entering edit mode

What is your input? A 1M*1M matrix? That is going to be 1 terabyte at least! If you want to cluster proteins, there are specialized software to do that.

ADD REPLY • link 13.2 years ago by lh3 33k

2

Entering edit mode

As lh3 notes, it's very memory intensive, because of the quadratic distance matrix. In addition I would very much question if this task makes sense. Often, the intention to cluster so many objects arises from poor filtering, lack of hypothesis or misunderstanding of the problem. So I doubt that you really want to do what you say you want to do.

ADD REPLY • link 13.2 years ago by Michael 54k

2

Entering edit mode

I'd agree with Michael's comment. Are you sure that you want to build such a tree? What would you do with it? It would be pretty much impossible to visualize. Are there even tools available to navigate around/analyze such a tree or subsets of it?

ADD REPLY • link 13.2 years ago by Neilfws 49k

score 8 · Answer 1 · 2011-02-07

There are lots of very sensible comments already to this, let me add one, then ignore it and carry on! Price et al say

"The distance matrix for families with 100,000–200,000 members requires 20–80 GB of memory to store (a 4-byte floating-point value for each of N(N −1)/2 pairs). Although computers with this much memory are available, the typical node in a compute cluster has an order of magnitude less memory."

Lets not be too sensible though, I want to see a tree of 1 million protein sequences, I think it would be both useful and fun! I don't think it is impossible, even today with some lateral thinking. I don't think it has ever been done though, so there is no pre-existing methodology. If the distance matrix already exists there might be a way to handle it without loading everything into memory. The relaxed NJ methods of FastTree and ClearCut programes might be a way forward, or maybe not.

The FastTree website says "FastTree can handle alignments with up to a million of sequences in a reasonable amount of time and memory." So maybe just start again with an alignment could be an option?

How to visualize it would certainly be a challenge. But two points; firstly we don't have to visualize it get get information, interesting tree stats can be extracted without seeing anything. Secondly, maybe it can be visualized. There is a decent amount of research on this topic- things that don't just throw the whole tree at you, but allow real time zooming. Also I have seen some cool images from Walrus graph visualization tool with about 500k tips. I have an (untested) suspicion that ARB could also handle a tree of 1 million tips (as long as you didn't have to make the tree in the programme.

Its not going to be easy, but don't lets conclude too fast that it can't be done.

score 2 · Answer 2 · 2011-02-07

Hi Rutger,

I'm not sure if this could be an option for you but you could use a graph-based DB system like Neo4j for storing all this data.

Compared to relational DB systems where information must be flattened up in tables, with a graph-based DB approach you can carry out many processes/algorithms that directly are impossible with the other option.

Cheers,

Pablo

score 2 · Answer 3 · 2011-02-09

2

Entering edit mode

13.2 years ago

Botond Sipos ★ 1.7k

Check out RAPIDNJ by Mailund et al. and this related article addressing the problem of large NJ trees.

ADD COMMENT • link 13.2 years ago by Botond Sipos ★ 1.7k

score 1 · Answer 4 · 2011-02-07

In principle, it shouldn't be a problem, only in practice - I agree with Dave's answer: the memory requirements will be so large that in practice, you won't be able to do it on most currently available hardware. At least if you use the "naive" textbook approach of doing it. But I'm sure it should be possible to come up with some dark trickery to do it nonetheless:

Now, I guess this may not be the answer to your exact question, but very large trees have already been reconstructed. Specifically, I heard a talk by Stephen A. Smith recently. He showed a phylogenetic analysis he had recently done, using 100 000 taxa, based on DNA sequence data and likelihood methods, if I remember correctly. I therefore have a vague feeling that you may be interested in the stuff he's been doing.

I don't know whether he used this program for his analysis, but certainly during his talk he mentioned his software PHLAWD.

Hope that helps

score 1 · Answer 5 · 2011-02-11

So you already have those distances precomputed? Computing the distances alone will be a big bottle-neck. For clustering, my first thought was also FastTree, but I don't think you can input a distance matrix. FastTree computes the pairwise distances from an alignment. And even if you could, there is no way you could fit a 1e6x1e6/2 distance matrix into memory.

Have a look at MC-UPGMA which is a memory-restrained UPGMA implementation. They clustered all of UniProt as an example in their publication. Not sure how flexible it is though (never tried it myself).

I would be interested to see how you are going to visualise that tree...

Andreas