Just for fun I had ChatGPT try this too. Here's what it came up with.
A few notable differences:
- it uses numpy
- it is using a full matrix rather than triangular half matrix as biopython does
- It only produced a cladogram, so there are no branch lengths (I did tell the AI this, and it can produce updated code to change that) YMMV.
import numpy as np
def create_newick_tree(dist_matrix, labels):
"""Create a Newick format tree from a distance matrix and labels."""
num_taxa = len(labels)
# Initialize the node labels
node_labels = labels.copy()
# Initialize the distance matrix
dist = dist_matrix.copy()
# Create the initial set of nodes
nodes = [i for i in range(num_taxa)]
# Keep track of the next available node ID
next_node_id = num_taxa
# Create the tree
while len(nodes) > 1:
# Find the two closest nodes
min_dist = np.min(dist[np.triu_indices(num_taxa, k=1)])
min_i, min_j = np.where(dist == min_dist)
min_i, min_j = min_i, min_j
# Create a new node to represent the cluster
new_node_id = next_node_id
next_node_id += 1
# Update the node labels and distances
node_labels[new_node_id] = "(" + node_labels[min_i] + "," + node_labels[min_j] + ")"
for k in range(num_taxa):
if k != min_i and k != min_j:
dist[min_i,k] = (dist[min_i,k] + dist[min_j,k]) / 2.0
dist[k,min_i] = dist[min_i,k]
dist = np.delete(dist, min_j, axis=0)
dist = np.delete(dist, min_j, axis=1)
node_labels = np.delete(node_labels, min_j)
# Add the new node to the list of nodes
# Return the Newick format string
return node_labels + ";"
# Example usage
dist_matrix = np.array([[0, 5, 9, 9], [5, 0, 10, 10], [9, 10, 0, 8], [9, 10, 8, 0]])
labels = np.array(["A", "B", "C", "D"])
newick_tree = create_newick_tree(dist_matrix, labels)
It claims the result will be:
DISCLAIMER I haven't confirmed that this is correct at anything other than a glance. Just include it for 'fun'.
I did challenge it to re-implement this task using biopython and it give a very similar solution to mine above.
It may be useful to add "pair-wise distance matrix" to title itself since your question is specifically about that.
Where are you getting the matrix from? Can the tool that gave it you not create a tree? It's unusual to only spit out the matrix.
If you can get the distance matrix into the right format, its possible you could 'hack this is' to the Bio.Phylo.TreeConstruction module and 'trick' it in to thinking it created the object in the first place so you can apply all the other methods it has available.
You haven't specified what kind of tree you intend to derive from this table though - NJ, UPGMA, ML etc?