Parsing PhyloXML to split gene trees at duplication nodes
1
0
Entering edit mode
8.0 years ago
spiral01 ▴ 110

Hi, I am new to the PhyloXML format and have a question about parsing files to split gene trees at duplication nodes. I have 9000 files, each containing a gene tree representing the evolution of a gene family. I want to split each gene tree at each duplication node. My question is simply how do you isolate subtrees at a duplication node (event type duplication)? Apologies if this is obvious, I cannot find any guidance on this matter or any examples online.

PhyloXML BioPython gene trees • 1.9k views
ADD COMMENT
2
Entering edit mode
8.0 years ago
Eric T. ★ 2.8k

Once you've loaded the PhyloXML data in Biopython, you can treat any internal node of the tree (a Clade object) as a subtree for the purpose of writing it out to another file. Is that what you want to do?

If so, then you can scan the tree for duplication event nodes using the tree methods find_clades or find_elements. Something like this might work:

for i, clade in enumerate(tree.find_clades(events=True)):
    if clade.events.duplications:
        Phylo.write(clade, "subtree_%d.phyloxml" % i, "phyloxml")
ADD COMMENT
0
Entering edit mode

Hi Etal, thanks that is exactly what I was looking for! Is it possible to then remove any subtrees that have less than for example 3 species from the main gene tree? Due to the number of gene trees I have I am looking to automate the process of throwing out subtrees that do not meet a certain threshold. I assume this can be done using another if statement, but can biopython pick out species using the tree methods you mentioned or would I have to define the species names? Thanks.

ADD REPLY
0
Entering edit mode

It depends how your tree is structured, but if each leaf of the tree represents a species, then you can count the terminal nodes under each clade, e.g. change the middle line to:

    if clade.events.duplications and clade.count_terminals() >= 3:

Otherwise if you need to do more checks on each terminal to extract the species name, it looks like:

    if clade.events.duplications and clade.count_terminals() >= 3:
        unique_names = set(tuple(tip.name.split[:2]) for tip in clade.get_terminals())
        if len(unique_names) >= 3:
            ...
ADD REPLY

Login before adding your answer.

Traffic: 2310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6