Question: VCF + Phylogenetic tree : how to calculate the distance between two set of samples
0
gravatar for Pierre Lindenbaum
4.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

I'm trying to build a simple tool that would build a phylogenetic tree from a VCF.

My aim is to verify that the related + sequenced individuals are close in the phylogenetic tree.

 

My current code works  with a small set of related individuals but it fails when some extra individuals are added. Current algorithm:

* a genotype is an enumeration Gtype: HOM_REF, HOM_ALT, HET . 'N/A' is converted to HOM_REF.

*  I'm looking iteratively for the 'nodes' having the smallest distance. A node is a set of samples. The algorithm starts with all the possible pairs of samples.

* For a pair of samples

  • AA vs AA: = distance=0
  • AA vs AB: = distance=5
  • AA vs BB: = distance=15

* This is where I'm looking for the right  algorithm : when merging two samples to create a 'merged node' how should I calculate the distance ?

my current algorithm for a position in the VCF is:

dist=0;
for(Gtype g1: allGenotypes(node1))
  for(Gtype g2: allGenotypes(node2))
      d += distance(g1,g2)
d /= ( countGenotypes(node1)+countGenotypes(node2))

 

what would be the best way to 'score' the distance between two sets of samples for a given position in the VCF ?

 

distance tree phylogenetic vcf • 2.6k views
ADD COMMENTlink modified 4.6 years ago by Jeremy Leipzig18k • written 4.6 years ago by Pierre Lindenbaum120k
2
gravatar for Jeremy Leipzig
4.6 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

I would also look at how vcftools calculates relatedness

https://github.com/shantanusharma/vcftools/blob/b762dd590bdd4b4fb964385747ea382e2978d855/trunk/cpp/variant_file_output.cpp#L4061-L4174

ADD COMMENTlink written 4.6 years ago by Jeremy Leipzig18k
2
gravatar for David W
4.6 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

It's not quite clear to me what makes a "merged" node, do you mean a group that is already "joined" as each other's closest relatives? 

If so this seems like a classic heirechchal clustering problem. You could use UPGMA or neighbor-joining to create your tree from a distance matrix.  

ADD COMMENTlink written 4.6 years ago by David W4.7k
1
gravatar for Zev.Kronenberg
4.6 years ago by
United States
Zev.Kronenberg11k wrote:

I would use pairwise identity by state for the distance matrix.

ADD COMMENTlink written 4.6 years ago by Zev.Kronenberg11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1094 users visited in the last hour