Question: Newick trees analysis software
0
gravatar for roussine
2.0 years ago by
roussine10
roussine10 wrote:

Hello everyone-

please take a min to answer in case you might know. My need is to output basic parameters of text newick trees: branch lengths and node supports (all possibly with average values or other basic statistics, but it is not a priority). Is there a nice pipelinable soft around to do so? - before I need to go into scripting, The trees are very many, so pipelining is essential. Both Unix or Win is ok.

Thanks much in advance, - Leo

newick tree analysis software • 1.0k views
ADD COMMENTlink modified 2.0 years ago by Joe19k • written 2.0 years ago by roussine10

How do you need the output? As labels within a plotted tree? You might want to consider that Newick format is not ideal for this, because it is not set that the values you need even are in there. All depends on the software writing the output. If you could provide a little example it might help, but this format is rather easy to parse.

ADD REPLYlink written 2.0 years ago by Michael Dondrup48k

Michael – thanks for your response. The need is just a text output per each text tree: all branch lengths (or average), all node supports (or average). The values not need be in one file. Extracting these values from trees is now the step that would require ad hoc scripting if a nice pipelinable soft is missing around.. I do not have an example tree at hand for the moment, but those are nice standard newicks output by FastTree.

ADD REPLYlink written 2.0 years ago by roussine10

If you need the branch lengths, they're already encoded inside the newick format - what do you want to do with the values?

Here's some code I wrote a while back to work out distances in trees:

https://github.com/jrjhealey/bioinfo-tools/blob/master/tree_dists.py

You can use it like so, to get the pairwise distances between all tips: python tree_dists.py -m all -s newick -i mytreefile.tree.

You can also use it like so, to get the distance between the 2 most distant tips: python tree_dists.py -m max -s newick -i mytreefile.tree. The max is default, so you can also run this without -m to get the same result. If it's useful, I'll be happy to edit the code to alter the output formats or to provide other calculation options.

ADD REPLYlink written 2.0 years ago by Joe19k

Dear jrj.healey – my need for this case is rather simple: to basically parse newick and extract numerical values of branch lengths and node support. I will then do basic statistics to assess trees and bin them further. This is a constitutive step of our phylogenomic approach to analyse orthology groups. Did you think of making your script extract such values?

ADD REPLYlink written 2.0 years ago by roussine10

I can look in to it. It shouldnt be difficult. ETE3s object model stores nodes with their associated values I believe.

Could you mock up some example input and output you'd expect?

ADD REPLYlink written 2.0 years ago by Joe19k

Ok, example in-outs are like this.

Input would be a standard newick:

((logi|XP_009052348.1:0.30900,(dapu|EFX67985.1:0.40918,dare|NP_001007771.1:0.18921)0.580:0.08422)0.826:0.09733,cate|ELT98251.1:0.29370,(lian|XP_013420576.1:0.18354,((ocbi|XP_014783723.1:0.22136,(neve|XP_001634838.1:1.09355,(scma|XP_018652019.1:0.58808,ecmu|CDS40328.1:0.89059)0.920:0.47871)0.738:0.11872)0.572:0.03167,hero|XP_009022332.1:0.79732)0.005:0.06582)0.790:0.07221);

Output would be plain text columns:

Tree 1 (= treefile name)    
br_lens
0.30900
0.40918
…
AVERAGE: …
MEDIAN: …

node_support
0.580
0.826
…
AVERAGE: …
MEDIAN: …

And so on for each of n (many thousands) trees. If it all goes to one or separate files – whatever is easier to implement. Please let me know if I made sense.

ADD REPLYlink written 2.0 years ago by roussine10
4
gravatar for Joe
2.0 years ago by
Joe19k
United Kingdom
Joe19k wrote:

Ok, I haven't worked it in to my other code just yet, but here's some approaches you could use based around ete3:

from ete3 import Tree
import sys
from statistics import median

with open(sys.argv[1], 'r') as handle:
    t = Tree(handle.readline())

nodes = [node for node in t.traverse()]
# Get all branch lengths:
print('Tree = {}'.format(str(sys.argv[1])))
print('br_lens')
for node in nodes:
    print(node.dist)
print('AVERAGE: {}'.format(float(sum([node.dist for node in nodes])/len(nodes))))
print('MEDIAN: {}'.format(median([node.dist for node in nodes])))

# Support is basically a case of doing the same as the above.
print('\n')
print('node_support')
for node in nodes:
    print(node.support)
print('AVERAGE: {}'.format(float(sum([node.support for node in nodes])/len(nodes))))
print('MEDIAN: {}'.format(median([node.support for node in nodes])))

Given the input as bs.tree:

$ cat bs.tree
((logi|XP_009052348.1:0.30900,(dapu|EFX67985.1:0.40918,dare|NP_001007771.1:0.18921)0.580:0.08422)0.826:0.09733,cate|ELT98251.1:0.29370,(lian|XP_013420576.1:0.18354,((ocbi|XP_014783723.1:0.22136,(neve|XP_001634838.1:1.09355,(scma|XP_018652019.1:0.58808,ecmu|CDS40328.1:0.89059)0.920:0.47871)0.738:0.11872)0.572:0.03167,hero|XP_009022332.1:0.79732)0.005:0.06582)0.790:0.07221);

$ python3 script.py bs.tree

Tree = bs.tree
br_lens
0.0
0.09733
0.2937
0.07221
0.309
0.08422
0.18354
0.06582
0.40918
0.18921
0.03167
0.79732
0.22136
0.11872
1.09355
0.47871
0.58808
0.89059
AVERAGE: 0.3291227777777778
MEDIAN: 0.205285


node_support
1.0
0.826
1.0
0.79
1.0
0.58
1.0
0.005
1.0
1.0
0.572
1.0
1.0
0.738
1.0
0.92
1.0
1.0
AVERAGE: 0.8572777777777777
MEDIAN: 1.0

It's not the most elegant code in the world (it could probably be refactored to a function rather than loads of printing and list comprehensions) but hopefully that's close enough to what you need to suffice.

If you want to apply it to lots of trees I'd suggest doing something like:

$ for tree in *.tree ; do python3 script.py "$file" > "${file%.*}"_output.txt ; done

(or look in to parallel processing with GNU parallel or similar).

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Joe19k

Dear jrj.healey - great thanks for assistance and getting into this. I tried the script, python returns this:

SyntaxError: invalid syntax
$ python3 br_lens+n_supp.py tree.tre                 [ 3:21PM]
  File "br_lens+n_supp.py", line 6
    print(f'Tree = {sys.argv[1]}')

Might be version dependent.. Could you comment?
Thanks a lot

ADD REPLYlink written 2.0 years ago by roussine10

That's a strange error for sure. I would have expected it to be a little more infomative (usually it has an arrow depicting the issue). What version of python are you using? I think fstrings only appeared in 3.6 and later (I'm using 3.6.8 at the moment).

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Joe19k

It's Python 3.4.3 (default, Nov 12 2018, 22:25:49) on BioLinix 8. And yes - there is an arrow:

    print(f'Tree = {sys.argv[1]}')                                 ^

The code above doesn't show correctly: the arrow points to the last single quotation.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by roussine10

Yes, this is a python version error then. Can you upgrade to 3.6 or higher?

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Joe19k

I've updated the code in the answer post. I've dropped the use of fstrings, but you will still need to use Python3 as it needs the statistics module. It should run on <3.6 now however.

I also noticed I was missing the division from my average calculations, so I've updated that and the new output I get.

ADD REPLYlink written 2.0 years ago by Joe19k

Thank you - while I was fiddling with data, python3 returns:

Traceback (most recent call last):
  File "br_lens+n_supp.py", line 1, in <module>
    from ete3 import Tree
ImportError: No module named 'ete3'

ete3 is installed however at /usr/local/bin/ete3

Please let me know what am I missing?..

ADD REPLYlink written 2.0 years ago by roussine10

its probably a PYTHONPATH issue.

Try:

$ python3 -m pip install ete3

Then re-run and see if that works.

(Or better yet, switch to using python via conda and the process is even easier)

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Joe19k
1

Thank you indeed for valuable help. -- It all just works.

ADD REPLYlink written 2.0 years ago by roussine10
1
gravatar for Michael Dondrup
2.0 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

Can you use Python? Then the ETE library should be ok for you, see http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#reading-and-writing-newick-trees . There is also the R package ape that can do that.

Please let me know if that is sufficient or if you need more guidance based on a concrete example of input and output.

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Michael Dondrup48k
1

Thank you, and thank you to jrj.healey for making a working script. I am not a python guy so I couldn't have done it.

ADD REPLYlink written 2.0 years ago by Michael Dondrup48k

Michael – thanks indeed for your involvement. I will go through your suggestions in nearest time.

ADD REPLYlink written 2.0 years ago by roussine10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1353 users visited in the last hour
_