8.9 years ago by
Mountain View, CA
If I understand correctly, this is a question regarding how one can "cut" the hierarchical clustering to extract highly correlated nodes. There are a few options but they are dependent on the metrics that one uses, and require some arbitrary decisions.
From the result of Eisen's
CLUSTER program, you might notice that each internal node (
NODE1X, ..) in the output has a metric associated with it (the value in the last column in the output). Keep in mind that this value depends on the distance metric (e.g. Euclidean distance or Pearson correlation coefficient) and the linkage method (e.g. single-linkage, complete-linkage) you used when running
One immediate method is to pick an arbitrary cutoff to select nodes beyond a minimum quality. Let's say we want to select the nodes that have average correlation coefficient
r>0.7. The exact cut-off is dependent on how compact you'd like the clusters, therefore it is quite arbitrary. In statistic text, people often determine the number of clusters by plotting cluster number (
k, thereby gradually loosening of the cut-off) versus the compactness of the partitions, and then determines a suitable
k based on that plot.
Recent research instead focus on automatic (dynamic) selection of cut-off, with applications in gene expression data. I'll list a few references, but there are more.
"An improved algorithm for clustering gene expression data"
"Selection of informative clusters from hierarchical cluster tree with gene classes"
"Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R"
In summary, there is no simple answer to your question, everyone seems to do this differently. But it is certainly an active field.