Question: Selecting Nodes Of High Correlation In A Tree
gravatar for toni
10.3 years ago by
toni2.2k wrote:


I have a microarray experiment where I first processed a hierarchical clustering using R/Bioconductor. So, in particular I have a gene tree. This gene tree could be converted to a .gtr file.

Here is an small example of a gtr file (reporting the history of node joining)


The last column is the correlation measured at each node.

The question(s) is :

do you know how is calculated this "correlation" at each node in general (especially in Eisen software) ? Are there several common methods ? One much more used for gene expression ?

Which node correlation measure would you use to select clusters of highly correlated genes and then submit these to a GO analysis tool ? Is it a reliable process to make the gene selection before GO analysis ?



clustering microarray gene • 3.9k views
ADD COMMENTlink modified 23 months ago by RamRS28k • written 10.3 years ago by toni2.2k
gravatar for Haibao Tang
10.2 years ago by
Haibao Tang3.0k
Mountain View, CA
Haibao Tang3.0k wrote:

If I understand correctly, this is a question regarding how one can "cut" the hierarchical clustering to extract highly correlated nodes. There are a few options but they are dependent on the metrics that one uses, and require some arbitrary decisions.

From the result of Eisen's CLUSTER program, you might notice that each internal node (NODE1X, ..) in the output has a metric associated with it (the value in the last column in the output). Keep in mind that this value depends on the distance metric (e.g. Euclidean distance or Pearson correlation coefficient) and the linkage method (e.g. single-linkage, complete-linkage) you used when running CLUSTER.

One immediate method is to pick an arbitrary cutoff to select nodes beyond a minimum quality. Let's say we want to select the nodes that have average correlation coefficient r>0.7. The exact cut-off is dependent on how compact you'd like the clusters, therefore it is quite arbitrary. In statistic text, people often determine the number of clusters by plotting cluster number (k, thereby gradually loosening of the cut-off) versus the compactness of the partitions, and then determines a suitable k based on that plot.

Recent research instead focus on automatic (dynamic) selection of cut-off, with applications in gene expression data. I'll list a few references, but there are more.

"An improved algorithm for clustering gene expression data"

"Selection of informative clusters from hierarchical cluster tree with gene classes"

"Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R"

In summary, there is no simple answer to your question, everyone seems to do this differently. But it is certainly an active field.

ADD COMMENTlink written 10.2 years ago by Haibao Tang3.0k

Thank you very much. You indeed perfectly understood where were my concerns. It was a problem to me because my choice is often ward linkage (not provided in EISEN soft) so I use R then export R results to CDT, GTR, ATR files on my own and then use Java TreeView. So I needed to calculate the correlation at each node myself (hclust only provides "height" values of each node). But at the end I use arbitrary cutoff. Thanks for the refs associated to cluster selection.

ADD REPLYlink written 10.2 years ago by toni2.2k
gravatar for Daniel Swan
10.3 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

You might be interested in this paper.

"A common clustering method in the analysis of gene expression data has been hierarchical clustering. Usually the analysis involves selection of clusters by cutting the tree at a suitable level and/or analysis of a sorted gene list that is obtained with the tree. Cutting of the hierarchical tree requires the selection of a suitable level and it results in the loss of information on the other level. Sorted gene lists depend on the sorting method of the joined clusters. Author proposes that the clusters should be selected using the gene classifications."

ADD COMMENTlink modified 11 months ago by RamRS28k • written 10.3 years ago by Daniel Swan13k

Thank you Daniel. Very useful. Indeed, being able to select enriched nodes at different levels is interesting (instead of cutting the tree with possibly loss of information)

ADD REPLYlink written 10.2 years ago by toni2.2k
gravatar for Istvan Albert
10.3 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

Clustering makes use of similarity measures between elements. There are various different similarity measures: euclidian, correlation, cosine etc that one may employ. The actual numerical values may not be sufficient to identify the method used to produce them.

Joining nodes into clusters is a second stage, here again several other techniques may be used to link similar subgroups into a single one.

There is no right or wrong method, many people use pearson correlation as their metric, it has a fairly straightforward interpretation.

If you have genes of interest you can reduce your dataset to those genes only. This may increase the predictive power of your results because there will be fewer variables in play.

ADD COMMENTlink written 10.3 years ago by Istvan Albert ♦♦ 84k

Thank you Itsvan. Theoretical background of clustering is ok. Each node-value corresponds to the dissimilarity of the joined nodes (depending on distance/linkage you chose). Actually, this question arose from the fact that when using Cluster/TreeView with euclidian dist and complete linkage, the so-called correlation values at each node do not correspond to the (joined)-dissimilarity but is a scaled value in [0,1]. So I was wondering if there was a scaling applied or maybe a recursive customiezd function that would measure what you want on each node like a intra-variance or simple correlation.

ADD REPLYlink written 10.2 years ago by toni2.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1129 users visited in the last hour