Question: Phylogenetic tree for CNVs

0

hosein_salehi6 •

**0**wrote:Hello every one, I need design a phylogenetic tree for CNVs not SNPs. I am really appreciate, if there is some one to introduce me a software, or hints about commands and input files.

A phylogenetic tree starts (in the most popular class of methods) with a distance matrix. Once you are able to calculate some sort of distance between 2 samples with CNVs - you're done, you put the matrix of distances into Neighbor-Joining and you have it.

Otherwise you need to represent your data as 0/1s (is there a CNV or no) and use the methods of Maximum Parsimony.

1.9kThanks to give me this information,actually we have CNVs for 2000 samples. Can you please tell me how to calculate the matrix of distances for CNVs in 4 samples as below .

0At first, you need to create a table where row names will be distinct CNVs. Then for each sample you may put 1 if this sample has this CNV and 0 if it does not have this CNV. Then you may simply calculate Hamming distance.

1.9kThanks again, I have obtained the matrix that you mentioned like the below file:

I am very grateful if you kindly tell me, is the format of the above matrix right? and introduce me a method or commands to obtain Hamming distances for this matrix ?

0Hi, yes, it looks as a correct format - but why this CNV chr15:8924153-8982938 0 0 0 0 has only 0s? It mean none of the samples have this CNV. Also - deletion and duplication should be distinguished.

The easiest way to calculate a matrix of distances is to use R and the command https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/dist - it calculates distances between ROWS so you will need to transpose your matrix before putting it inside (because you care about phylogeny of samples, not phylogeny of CNVs). There is no Hamming distance - but you can start with Manhattan distance, in this case it should be equal. Or maybe binary distance? I am unfamiliar with that one.

If you use python, I am sure there is a function in numpy that does this in one command, but I'd go in a for cycle: for all the samples, for all the samples: calculate distance between sample X and Y, if they are different.

1.9kYes, you are right, another sample that I did not bring has this CNV(chr15:8924153-8982938), sorry for subsequent messages, you mean for distinguish between deletions and duplications I should allocate one column infront of the all CNVs and also put numbers (such as 0 and 1 for Del or Dup respectively) ?

0Better don't mix deletions and duplications at all. These are different events, happened in different evolutionary time. Or, if you want, you may put an exact copy-number instead of 0s and 1s (0,1,2,3,4,5,6 etc) - and calculate Manhattan distance instead of Hamming distance.

1.9k