**131k**wrote:

I'm trying to build a simple tool that would build a phylogenetic tree from a VCF.

My aim is to verify that the related + sequenced individuals are close in the phylogenetic tree.

My current code works with a small set of related individuals but it fails when some extra individuals are added. Current algorithm:

* a genotype is an enumeration Gtype: HOM_REF, HOM_ALT, HET . 'N/A' is converted to HOM_REF.

* I'm looking iteratively for the 'nodes' having the smallest distance. A node is a set of samples. The algorithm starts with all the possible pairs of samples.

* For a pair of samples

- AA vs AA: = distance=0
- AA vs AB: = distance=5
- AA vs BB: = distance=15

* This is where I'm looking for the right algorithm : when merging two samples to create a 'merged node' how should I calculate the distance ?

my current algorithm for a position in the VCF is:

dist=0; for(Gtype g1: allGenotypes(node1)) for(Gtype g2: allGenotypes(node2)) d += distance(g1,g2) d /= ( countGenotypes(node1)+countGenotypes(node2))

what would be the best way to 'score' the distance between two sets of samples for a given position in the VCF ?

**19k**• written 6.0 years ago by Pierre Lindenbaum ♦

**131k**