I have a Newick tree that is built by comparing similarity (euclidean distance) of Position Weight Matrices (PWMs or PSSMs) of DNA regulatory motifs that are ~5-9 bp long sequences.
An interactive version of my tree is up on iTol (here, which you can freely play with - just press "update tree" after setting your parameters:
My specific goal: to collapse the motifs (tips/terminal nodes/leaves) together if their average distances to the nearest parent clade is < X (ETE2 Python package). This is biologically interesting since some of the gene regulatory DNA motifs may be homologous (paralogues or orthologues) with one another. This collapsing can be done via the iTol GUI linked above, e.g. if you choose X = 0.001 then some motifs become collapsed into triangles (motif families).
My question: How do I know which value of X is appropriate for maximising the biological relevance of the collapsed motifs? Perhaps I can plot some statistic against the value of X? I've tried plotting X vs. mean clusterSize but I don't see an obvious "step increase" to inform me which value of X to use: