Question: VCF + Phylogenetic tree : how to calculate the distance between two set of samples
0
4.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

I'm trying to build a simple tool that would build a phylogenetic tree from a VCF.

My aim is to verify that the related + sequenced individuals are close in the phylogenetic tree.

My current code works  with a small set of related individuals but it fails when some extra individuals are added. Current algorithm:

* a genotype is an enumeration Gtype: HOM_REF, HOM_ALT, HET . 'N/A' is converted to HOM_REF.

*  I'm looking iteratively for the 'nodes' having the smallest distance. A node is a set of samples. The algorithm starts with all the possible pairs of samples.

* For a pair of samples

• AA vs AA: = distance=0
• AA vs AB: = distance=5
• AA vs BB: = distance=15

* This is where I'm looking for the right  algorithm : when merging two samples to create a 'merged node' how should I calculate the distance ?

my current algorithm for a position in the VCF is:

```dist=0;
for(Gtype g1: allGenotypes(node1))
for(Gtype g2: allGenotypes(node2))
d += distance(g1,g2)
d /= ( countGenotypes(node1)+countGenotypes(node2))
```

what would be the best way to 'score' the distance between two sets of samples for a given position in the VCF ?

distance tree phylogenetic vcf • 2.6k views
modified 4.6 years ago by Jeremy Leipzig18k • written 4.6 years ago by Pierre Lindenbaum120k
2
4.6 years ago by
Jeremy Leipzig18k wrote:

I would also look at how vcftools calculates relatedness

https://github.com/shantanusharma/vcftools/blob/b762dd590bdd4b4fb964385747ea382e2978d855/trunk/cpp/variant_file_output.cpp#L4061-L4174

2
4.6 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

It's not quite clear to me what makes a "merged" node, do you mean a group that is already "joined" as each other's closest relatives?

If so this seems like a classic heirechchal clustering problem. You could use UPGMA or neighbor-joining to create your tree from a distance matrix.

1
4.6 years ago by
United States
Zev.Kronenberg11k wrote:

I would use pairwise identity by state for the distance matrix.