I am trying to cut the dendrogram tree using the package dynamicTreeCut, I prefer dynamic cutting and clustering. I run the code below
clusDyn <- cutreeDynamic(hr, distM = as.matrix(as.dist(1-cor(t(scaledata)))), method = "hybrid")
However, it produces 160 clusters, which is too many to analyze each one of them individually. Is it possible to tell to cut tree dynamically but also to group them in such a way that it produces only a specific number of clusters? For example, I would like 20 clusters after the dynamic tree cut instead of 160 clusters.
I know that if I cut the dendrogram at a specific height then I could possibly decide the number of clusters it would generate but I prefer Dynamic tree cutting.
This is happening because the input is a simple correlation matrix that is affected by spurious or missing connections (see this paper).
I am very new to RNAseq analysis and clustering. Can you please elaborate on it, do you mean to say that Pearson correlation is not enough for this clustering and I should look for other methods? Is WGCNA a better workflow?
Help me to understand. Is this a clustering analysis of differentially expressed genes or an unsupervised clustering analysis (eg WGCNA)?
These are differentially expressed genes, which are around 15K genes from a total of 30 K genes. Then I follow the clustering protocol as given in this link (the genes are scaled and then clustered by Pearson correlation)- https://2-bitbio.com/2017/04/clustering-rnaseq-data-making-heatmaps.html
I don't think the
cutreeDynamic
function will work very well with a distance matrix calculated from pearson correlation values:as.matrix(as.dist(1-cor(t(scaledata))))
. Just to be sure, how did you calculatehr
(the link doesn't work for me)?thank you for the effort, I did calculate the hr as you have shown. hr <- hclust(as.dist(1-cor(t(scaledata), method="pearson")), method="complete")
As it seems that Pearson correlation values do not work well with cutreeDynamic, can you please suggest something that I can look into, to make a better correlation matrix?
Look, I am not familiar with workflows used for the detection of clusters of differentially expressed genes. What I can tell you is that
cutreeDynamic
, with the default settings, doesn't work very well when the distance matrix is calculated just from pearson correlation values.If you want to use
cutreeDynamic
, there are settings that you can change in oder to reduce number of clusters. For example, see:minClusterSize, deepSplit, cutHeight
, andmaxCoreScatter
(usage)