Hello every one, I need design a phylogenetic tree for CNVs not SNPs. I am really appreciate, if there is some one to introduce me a software, or hints about commands and input files.
A phylogenetic tree starts (in the most popular class of methods) with a distance matrix. Once you are able to calculate some sort of distance between 2 samples with CNVs - you're done, you put the matrix of distances into Neighbor-Joining and you have it.
Otherwise you need to represent your data as 0/1s (is there a CNV or no) and use the methods of Maximum Parsimony.
Thanks to give me this information,actually we have CNVs for 2000 samples. Can you please tell me how to calculate the matrix of distances for CNVs in 4 samples as below .
Sample 1 Sample 2 Sample 3 Sample 4
chr1:110238914-110324454 chr1:110238914-110324454 chr1:110238914-110387808 chr1:110238914-110324454
chr1:135193671-135391358 chr1:134424148-134494368 chr1:135193671-135391358 chr1:1566976-1619087
chr1:158248715-158335919 chr1:239185878-239256562 chr1:158248715-158335919 chr1:27044617-27097748
chr1:497720-732829 chr1:65562670-65627908 chr1:65562670-65661847 chr1:65562670-65627908
chr1:65562670-65627908 chr10:15373-142831 chr15:10067661-10139569 chr11:1344991-1635177
chr1:823684-1181464 chr10:39446610-40498493 chr15:10606818-10704696 chr11:49540890-49620077
chr10:15273-141831 chr10:41625777-41707007 chr15:8924153-8982938 chr15:10067661-10139569
chr10:39446610-39498493 chr10:64793832-64844203 chr16:46263468-46350038 chr15:10606818-10704696
chr10:41625777-41707007 chr11:1344991-1635177 chr16:46782270-46917860 chr15:8924153-8982938
chr10:64793832-64844203 chr11:49540890-49620077 chr16:46782270-47022029 chr16:46263468-46350038
chr13:53394558-53502226 chr15:8924153-8982938 chr16:49036404-49161759 chr16:46782270-47022029
chr13:66202-597523 chr16:49036404-49161759 chr16:53893227-53985854 chr16:53893227-53985854
chr15:8924153-8982938 chr19:59139745-59196476 chr17:21800797-21873869 chr17:21800797-21873869
chr16:49036404-49161759 chr19:59704180-59777494 chr17:876845-930007 chr17:876845-930007
chr19:59139745-59196476 chr20:25553479-25604482 chr19:59139745-59196476 chr20:25553479-25604482
chr19:59704180-59777494 chr20:26886276-26940171 chr19:59704180-59777494 chr20:26886276-26940171
At first, you need to create a table where row names will be distinct CNVs. Then for each sample you may put 1 if this sample has this CNV and 0 if it does not have this CNV. Then you may simply calculate Hamming distance.
Thanks again, I have obtained the matrix that you mentioned like the below file:
All CNVs Sample 1 Sample 2 Sample 3 Sample 4
chr1:110238914-110324454 0 1 1 0
chr1:135193671-135391358 1 1 0 1
chr1:158248715-158335919 0 1 0 1
chr1:497720-732829 1 1 0 1
chr1:65562670-65627908 1 1 0 1
chr1:823684-1181464 1 1 0 1
chr10:15273-141831 0 0 1 1
chr10:39446610-39498493 0 1 1 0
chr10:41625777-41707007 0 0 1 0
chr10:64793832-64844203 1 1 0 1
chr13:53394558-53502226 0 0 0 1
chr13:66202-597523 1 1 0 1
chr15:8924153-8982938 0 0 0 0
chr16:49036404-49161759 1 1 0 0
chr19:59139745-59196476 0 0 1 0
chr19:59704180-59777494 1 1 1 0
chr1:134424148-134494368 1 0 1 0
I am very grateful if you kindly tell me, is the format of the above matrix right? and introduce me a method or commands to obtain Hamming distances for this matrix ?
Hi, yes, it looks as a correct format - but why this CNV chr15:8924153-8982938 0 0 0 0
has only 0s? It mean none of the samples have this CNV. Also - deletion and duplication should be distinguished.
The easiest way to calculate a matrix of distances is to use R and the command https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/dist - it calculates distances between ROWS so you will need to transpose your matrix before putting it inside (because you care about phylogeny of samples, not phylogeny of CNVs). There is no Hamming distance - but you can start with Manhattan distance, in this case it should be equal. Or maybe binary distance? I am unfamiliar with that one.
If you use python, I am sure there is a function in numpy that does this in one command, but I'd go in a for cycle: for all the samples, for all the samples: calculate distance between sample X and Y, if they are different.
Yes, you are right, another sample that I did not bring has this CNV(chr15:8924153-8982938), sorry for subsequent messages, you mean for distinguish between deletions and duplications I should allocate one column infront of the all CNVs and also put numbers (such as 0 and 1 for Del or Dup respectively) ?
Better don't mix deletions and duplications at all. These are different events, happened in different evolutionary time. Or, if you want, you may put an exact copy-number instead of 0s and 1s (0,1,2,3,4,5,6 etc) - and calculate Manhattan distance instead of Hamming distance.