Question: Bootstrapping => consensus tree construction based on distance matrices
0
kaaschr10 wrote:

For my PhD project i am trying to use copy number variations (CNVs) to deduce phylogeny between 9 cancer cell lines... and it seem to work quite well. It is not hard to  get to a 9x9 distance matrix showing the copy number variations between each of them (400-2000 CNVs out of 20000 genes in the genome). But do any of you have experience with how to draw a consensus tree with bootstrapping from this (and not just a tree based on the one distance matrix)? I have looked into the R-packages ape, phangorn and phyclust but all of them require you to start with sequences that you then align. They are not very happy about accepting a distance matrix alone (because bootstrapping require taking sub-samples from your dataset).

It is not hard to make e.g. 1000 new distance matrices based on extraction of e.g. 1/4 the dataset by random
coll[sample(nrow(coll),size=20000/4,replace=FALSE),]

So I believe the question is: Would it be possible to use 1000 distance matrices to create one concensus phylogenetic tree?  R would be preferred but any input is appreciated.

Kind regards
-Christian

ape bootstrap phylogenetics R • 5.3k views
modified 5.6 years ago by David W4.7k • written 5.6 years ago by kaaschr10
0
David W4.7k wrote:

Hi Christian,

If I understand you, this should be doable with the normal `ape` functions. `boot.phylo` resambles a matrix "columnwise". So, if you have your copy number data in a matrix with loci in columns and taxa in rows (like this fake data)

````library(ape)`
`library(MASS)`
`set.seed(123)`

`tr <- rcoal(10)`

```# 'evolve' copy number down the tree using Brownian motion
cn_matrix <- ceiling( t( mvrnorm(100, mu=rep(3,10), Sigma=vcv(tr))))``````

Then all you need to do is create a function to estimate your tree with the matrix as input. Here's the simplest one:

``````estimate_tr <- function(m) nj(dist(m, method="manhattan"))
point_est <- estimate_tr(cn_matrix)
bs <- boot.phylo(point_est, cn_matrix, estimate_tr, trees=TRUE)
con <- consensus(bs\$trees, p=0.5)``````

That should give you a consensus tree `?boot.phylo` has some more documentation for related methods.

Hi David

thank you for answering my question. The method works... but the resulting concensus tree loose most of the information seen in the original tree. If we use your example and run your code and follow it with

bs <- boot.phylo(point_est, cn_matrix, estimate_tr, trees=TRUE, B=1)
bs2 <- boot.phylo(point_est, cn_matrix, estimate_tr, trees=TRUE, B=500)
con <- consensus(bs\$trees, p=0.5)
con2 <- consensus(bs2\$trees, p=0.5)
plot(con, type="u", main="Con,B=1")
windows()
plot(con2, type="u", main="Con,B=500")
windows()
plot(point_est, type="u", main="original")

you get a plot of what a straightforward nj tree (named "original) showing clusters of t3,4 another wih t6-8 and one with the rest. This is tree made from the entire dataset.
If you plot a consensus tree based on one single permutation of the data (B=1) all this is lost and you have a plot where the t3-4 and the t6-8 are still there but we cannot anymore conclude that they are distinct clusters from the rest of the dataset. I see the same result with my data: after running B=1 or B=1000 i end up with a tree that has no distinctive features.
Is it because my dataset is not really showing me anything (i severely doubt this) or is boot.phylo taking out a very very small subsample => the clusters are still there but the distance between them are now much much shorter than before? Can you explain to me why plotting after one round of bootstrapping is so fundamentally different from the original plot?

Hi Christian,

I can't talk about specifics because we are looking a different trees, for me, setting the seed and running the code gives the  point estimate tree as

`(t7,((((t3,t5),t10),t9),t8),((t1,t4),(t6,t2)));`

I'm not sure how you are defining clusters (on unrooted trees), but you should be aware consensus trees have no branch lengths.

In answer to some of your other questions, boot.phylo is re sampling the matrix with replacements. Using the variables above each replicate is equivalent to

`bs_mat <- cn_matrix[,sample(ncol(cn_matrix), replace=TRUE)]`

With B=1 you have the result of a single bootstrap sample, it will be different than the tree estimated from the whole data-set but no less resolved (though, again, the 'consensus' tree will have no branch lengths, if you want these just plot bs\$trees[])

Dear David

Thank your that that little comment. that made all the difference. Now the (B=1) is similar to the one from the entire dataset and B=100 is slight different. My dataset had no changes at all after B=1000 so it must be pretty stable. Thanks a lot for your help. Have a nice day