Optimal Distance Measure/Method To Perform Hierrachical Clustering
Entering edit mode
10.7 years ago
APJ ▴ 40

Hi, Hierachical clustering is a nice way of representing the samples difference and to look at the relationship between the samples. Although the way the clusters are formed in tree corresponds to how we calculate the distance measure(Single/Wards/Complete/Average) and the type of method (Euclidean/Pearson)used to calculate the distances between two data points . I have 16 RNA-seq samples, tried to perform hierarchical clustering on dataset, by using Euclidean distance measure and Wards methods, the tree genertated was different if i use Single-linkage method. Although by Single linkage method the tree is making sense in terms of biology, but I am not completely clear of the point which will be the optimal method to consider when we look at the RNA-seq data having count values??

Any suggestions please??

clustering • 7.5k views
Entering edit mode

Dear APJ, this is almost identical to the question you asked before. I strongly agree with Steve and suggest you study the resources he mentions. Sometimes it might be hard to accept that there is no absolute best method and all methods have their merits and applications, but bear in mind that these are unsupervised methods and that there is no gold standard, like for example in supervised classification and machine learning where one has, ideally, well annotated data and can therefore objectively be evaluated yielding Specificity, Sensitivity, ROC curves etc, all of which is not possible for cluster analysis. The best way to deal with it is to take biological knowledge into account, as you have already attempted.

Entering edit mode
10.7 years ago

There is no universally optimal distance measure -- clustering can be more of an art than a science.

Instead of suggesting one method over an other, a more useful suggestion I think would be for you to read through the DESeq2 vignette, sections 7 and 8 in particular. It will likely be a good idea for you to shoot your count data through one of these "variance stabilizing transformations" (either the vst or the rlogTransform) prior to clustering.

The analogous approach in "edgeR country" is to use their predFC function or cpm with a non-zero prior.count, as explained in section 2.10 of the edgeRUsersGuide.

Entering edit mode
10.7 years ago
KCC ★ 4.1k

One notion of cluster is a tightly associated group where all members are similar to all other members. Another notion of cluster is a group where there is a continuous path of small transitions from any one member to any other member. Both are valid ways of looking at clusters. The first notion can be more helpful when you are fishing for patterns since there is a high amount of signal in the group. It tends to be more stable. Single linkage is going to be sensitive to certain kinds of outliers that may serve as 'missing links' between two groups causing the algorithm to merge those groups.

If Single linkage is making more sense for your data, it would imply that the local structure of various parts of your data space are more important. Single linkage decides to merge clusters based on whether they have any members that are close to each other. Ward is merging based on the within cluster variance meaning how similar all the members in the cluster are before and after you merge clusters. So both linkages are going to yield very different things in certain situations. Imagine a dataset where there are several mostly distinct clusters which smush together slightly at the edges.

I spent several years working on different clustering techniques and my takeaway message is keep it simple. Use measures of distance that are not too complicated or that are commonly used in your field. My reason is simple, the result is easier to interpret and intelligible to a larger group of people. Use relatively simple techniques like hierarchical clustering or K-means. Use more than one method. Try to use more robust results, that show up in more than one approach.

A clustering job is a win if you have fewer groups than you started with. So in theory, if your raw data is 1000 and you have clustered it into 10 groups, 10 is still less complexity to deal with than 1000. I tend to favor clustering that optimizes for members that are all very similar to each other. Thus, downstream statistical analysis of the clusters is likely to yield simple models for each group.

In general, it is not surprising that we can't throw a mass of data of unknown structure into a procedure with few assumptions and consistently get biologically meaningful results. After all, the clustering procedure is not a biologist. So, the person doing the clustering must look for what biology he or she can find.

Clustering is indeed an art.


Login before adding your answer.

Traffic: 1591 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6