Detect trees (newick) with specific topology
3
1
Entering edit mode
4.7 years ago
ibasan ▴ 40

Dear community, i have trees (<3000) in newick format with four species like this example:

((Spec4:0.529207,(Spec3:0.0803395,Spec2:0.0124315)),Spec1:0,Spec1:0);


I am only interested to detect the trees in which two species are clustering together, like in the example Spec3 and Spec2. Is it possible to do that with a simple script or does anybody knows a software (tried phybin, ete3 compare already). I will be grateful if you someone could help.

tree clustering compare topology • 1.9k views
1
Entering edit mode

Not aware of tool to subset trees based on topology. Yes, a script/regex could help.

I am wondering if you have the images of the trees? If you do, may be it's interesting to try deep learning / computer vision-based approach here?

1
Entering edit mode

Dear Khader Shameer, atm i don't have the images of the trees (but could get them). Thanks for your reply.

2
Entering edit mode
4.7 years ago
Joe 19k

I think this might work, but it's a sort of 'brute force' way to do it. I would maybe re-factor your trees to cladograms and remove the branch lengths via a regex for the branch length and colon (in whatever your favourite regex language is), then you could simply grep or string search in some other manner for (Spec3,Spec2) and you'll find all trees which contain that grouping pretty easily.

e.g.: Remove decimals, sole zeros and colons from the file (probably not the most elegant regex):

((Spec4:0.529207,(Spec3:0.0803395,Spec2:0.0124315)),Spec1:0,Spec1:0);


One could do:

cat test.tree | sed -e 's/[0-9]*\.[0-9]*//g' -e 's/0//g' -e 's/://g'


Yeilding:

 ((Spec4,(Spec3,Spec2)),Spec1,Spec1);


Then you can string search your yielded trees:

egrep -r -l "Spec(2|3),Spec(2|3)" .


Will give you all the filenames where Species 3 and Species 2 are adjacent nodes (in either orientation).

If you want to keep branch length in your trees as you're not just interested in topology, you could concoct a regex for use with grep:

egrep "Spec(2|3):(0?|[0-9]+\.[0-9]+),Spec(2|3):(0?|[0-9]+\.[0-9]+)" treefile.tree


But having to conjure that regex for every possible combination of topologies looks awful to me, so I'd be inclined to try it without the branch lengths.

I don't know how many topologies you're interested in finding in all your trees - this approach may not be feasible if it's a prohibitively large number.

Slightly more complex, if you'd like to see the match, and the file name, this is an option:

2 example sed-treated trees:

((Spec4,(Spec5,Spec6)),Spec2,Spec3);
((Spec4,(Spec3,Spec2)),Spec1,Spec1);


Passing a 'dummy filename' in the form of dev/null tricks grep in to printing the filename (as it thinks it's working on multiple files) and the actual match itself by default:

for file in *.tree ; do egrep "Spec(2|3),Spec(2|3)" "\$file" /dev/null ; done


Would yeild:

sed2.tree:((Spec4,(Spec5,Spec6)),Spec2,Spec3);
sed.tree:((Spec4,(Spec3,Spec2)),Spec1,Spec1);


With the appropriate string matches highlighted (if your terminal is configured for it).

1
Entering edit mode

Dear jrj.healey, due to the fact that i'm not interested in keeping the branch lengths your idea is exactly what i need. Thanks a lot!

0
Entering edit mode
4.7 years ago

Check if the species of interest are direct children of their common ancestor or simply count the number of children species of their common ancestor. This should be possible with most software with tree traversal capabilities. For example, the R phylobase package has the ancestor() and children() functions.

0
Entering edit mode

Dear Jean-Karim Heriche, i will have a look at the R phylobase package. Thanks!

0
Entering edit mode
4.7 years ago
Juke34 ★ 6.4k

I know one tool to do so, it's really powerful but it's in Prolog: bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-298

Maybe the paper cites other tools.

0
Entering edit mode

Thanks for the link Juke-34. I will have alook at it!