Question

reduce Newick tree and maintain greatest diversity in subset

2

Entering edit mode

8.5 years ago

voynich ▴ 20

I have several thousand sequences, which have been aligned and used to generate a Newick phylogeny tree. I want to find a smaller subset of sequences, X, such that the remaining sequences are still distributed uniformly across the phylogenic space.

So far I'm thinking about determining the number of branches at some distance from the root, until the number of branches is close to X. Then for each of these branches, I choose one tip sequence at random and delete the rest.

I am working with Python, but any pointers would be helpful. Thanks!

phylogeny tree • 2.3k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.5 years ago by voynich ▴ 20

0

Entering edit mode

Did you find a solution? Would be very interested to hear how you achieved this!

ADD REPLY • link 8.4 years ago by ONeillMB1 • 0

Ram · Answer 1 · 2015-10-28

This seems like it could be solved by a MST (minimum spanning tree)-like algorithm. Each iteration, find the shortest edge (distance between two leaves). Since that edge will join 2 nodes, you need to not only remove the edge, but discard one of the nodes, ideally the one closer their mutual next-closest node.

Iterate until you have the number of nodes you want. Storing edges in a heap (or priority queue) is often useful in this kind of scenario.

score 1 · Answer 2 · 2015-10-29

1

Entering edit mode

8.5 years ago

Biojl ★ 1.7k

You can work it out with ete2: http://etetoolkit.org/

It's python based and will let you parse nodes and leaves and calculate all sort of measures, it has some very interesting built-in functions.

ADD COMMENT • link 8.5 years ago by Biojl ★ 1.7k