Your tree resembles a regular gene tree with duplications. However, it's not clear to me if: 1) the duplicated items are always like in your example (all branches from the same species are grouped together) or 2) you could also have complex patterns like ((spAseq1, spBseq1), (spAseq2, spBseq2)).
If 1), you just need to colapse the species-specific-nodes into a single branch, choosing a method for summarising the distances therein (i.e. max branch length, average, sum, etc). You could easily do this in a programatic way using any phyloinformatics toolkit. I use ETE, but it would also be possible with biopython (Phylo), bioperl Bio:Phylo, etc
if 2), you would need to decompose your gene tree in all possible species subtrees. The TreeKO methodology is good for this, and I recently implemented it into ETE so it can be also used programatically. In brief, you will need to decompose your tree into multiple subtrees using the tree.get_speciation_trees() function. Then, you need to somehow make a consensus out of the resulting subtrees. For the consensus, you could just compute a distance matrix averaging the all-against-all distances observed among the species nodes, or build a consensus tree (check biopython for this).
3.3 years ago by
jhc • 2.8k