Question

Phylogenetic tree for CNVs

0

Entering edit mode

4.6 years ago

hosin • 0

Hello every one, I need design a phylogenetic tree for CNVs not SNPs. I am really appreciate, if there is some one to introduce me a software, or hints about commands and input files.

genome • 2.0k views

ADD COMMENT • link 4.6 years ago by hosin • 0

0

Entering edit mode

A phylogenetic tree starts (in the most popular class of methods) with a distance matrix. Once you are able to calculate some sort of distance between 2 samples with CNVs - you're done, you put the matrix of distances into Neighbor-Joining and you have it.

Otherwise you need to represent your data as 0/1s (is there a CNV or no) and use the methods of Maximum Parsimony.

ADD REPLY • link 4.6 years ago by German.M.Demidov ★ 3.0k

0

Entering edit mode

Thanks to give me this information,actually we have CNVs for 2000 samples. Can you please tell me how to calculate the matrix of distances for CNVs in 4 samples as below .

Sample 1    Sample 2    Sample 3    Sample 4
chr1:110238914-110324454    chr1:110238914-110324454    chr1:110238914-110387808    chr1:110238914-110324454
chr1:135193671-135391358    chr1:134424148-134494368    chr1:135193671-135391358    chr1:1566976-1619087
chr1:158248715-158335919    chr1:239185878-239256562    chr1:158248715-158335919    chr1:27044617-27097748
chr1:497720-732829  chr1:65562670-65627908  chr1:65562670-65661847  chr1:65562670-65627908
chr1:65562670-65627908  chr10:15373-142831  chr15:10067661-10139569 chr11:1344991-1635177
chr1:823684-1181464 chr10:39446610-40498493 chr15:10606818-10704696 chr11:49540890-49620077
chr10:15273-141831  chr10:41625777-41707007 chr15:8924153-8982938   chr15:10067661-10139569
chr10:39446610-39498493 chr10:64793832-64844203 chr16:46263468-46350038 chr15:10606818-10704696
chr10:41625777-41707007 chr11:1344991-1635177   chr16:46782270-46917860 chr15:8924153-8982938
chr10:64793832-64844203 chr11:49540890-49620077 chr16:46782270-47022029 chr16:46263468-46350038
chr13:53394558-53502226 chr15:8924153-8982938   chr16:49036404-49161759 chr16:46782270-47022029
chr13:66202-597523  chr16:49036404-49161759 chr16:53893227-53985854 chr16:53893227-53985854
chr15:8924153-8982938   chr19:59139745-59196476 chr17:21800797-21873869 chr17:21800797-21873869
chr16:49036404-49161759 chr19:59704180-59777494 chr17:876845-930007 chr17:876845-930007
chr19:59139745-59196476 chr20:25553479-25604482 chr19:59139745-59196476 chr20:25553479-25604482
chr19:59704180-59777494 chr20:26886276-26940171 chr19:59704180-59777494 chr20:26886276-26940171

ADD REPLY • link 4.6 years ago by hosin • 0

0

Entering edit mode

At first, you need to create a table where row names will be distinct CNVs. Then for each sample you may put 1 if this sample has this CNV and 0 if it does not have this CNV. Then you may simply calculate Hamming distance.

ADD REPLY • link 4.6 years ago by German.M.Demidov ★ 3.0k

0

Entering edit mode

Thanks again, I have obtained the matrix that you mentioned like the below file:

All CNVs    Sample 1    Sample 2     Sample 3       Sample 4
chr1:110238914-110324454        0   1   1   0
chr1:135193671-135391358        1   1   0   1
chr1:158248715-158335919        0   1   0   1
chr1:497720-732829      1   1   0   1
chr1:65562670-65627908      1   1   0   1
chr1:823684-1181464     1   1   0   1
chr10:15273-141831      0   0   1   1
chr10:39446610-39498493     0   1   1   0
chr10:41625777-41707007     0   0   1   0
chr10:64793832-64844203     1   1   0   1
chr13:53394558-53502226     0   0   0   1
chr13:66202-597523      1   1   0   1
chr15:8924153-8982938       0   0   0   0
chr16:49036404-49161759     1   1   0   0
chr19:59139745-59196476     0   0   1   0
chr19:59704180-59777494     1   1   1   0
chr1:134424148-134494368        1   0   1   0

I am very grateful if you kindly tell me, is the format of the above matrix right? and introduce me a method or commands to obtain Hamming distances for this matrix ?

ADD REPLY • link 4.6 years ago by hosin • 0

0

Entering edit mode

Hi, yes, it looks as a correct format - but why this CNV chr15:8924153-8982938 0 0 0 0 has only 0s? It mean none of the samples have this CNV. Also - deletion and duplication should be distinguished.

The easiest way to calculate a matrix of distances is to use R and the command https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/dist - it calculates distances between ROWS so you will need to transpose your matrix before putting it inside (because you care about phylogeny of samples, not phylogeny of CNVs). There is no Hamming distance - but you can start with Manhattan distance, in this case it should be equal. Or maybe binary distance? I am unfamiliar with that one.

If you use python, I am sure there is a function in numpy that does this in one command, but I'd go in a for cycle: for all the samples, for all the samples: calculate distance between sample X and Y, if they are different.

ADD REPLY • link 4.6 years ago by German.M.Demidov ★ 3.0k

0

Entering edit mode

Yes, you are right, another sample that I did not bring has this CNV(chr15:8924153-8982938), sorry for subsequent messages, you mean for distinguish between deletions and duplications I should allocate one column infront of the all CNVs and also put numbers (such as 0 and 1 for Del or Dup respectively) ?

ADD REPLY • link 4.6 years ago by hosin • 0

0

Entering edit mode

Better don't mix deletions and duplications at all. These are different events, happened in different evolutionary time. Or, if you want, you may put an exact copy-number instead of 0s and 1s (0,1,2,3,4,5,6 etc) - and calculate Manhattan distance instead of Hamming distance.

ADD REPLY • link 4.6 years ago by German.M.Demidov ★ 3.0k

0

Entering edit mode

Hi, Thanks for your answers. Is it possible to do PCA and Admixture analyses only for CNV data not for SNPs, do you have any experience about that?

ADD REPLY • link 4.3 years ago by hosin • 0