Question: Is hierarchical clustering of significant genes 'supervised' or 'unsupervised' clustering?
gravatar for steve
4.2 years ago by
United States
steve2.7k wrote:

I have differential gene expression data from an RNA-Seq experiment. After filtering to include only genes from a previously developed Gene Ontology list, and filtering for significant genes based on adjusted p value, I am using the R pheatmap package to make clustering heatmaps.

Does this count as 'supervised' or 'unsupervised' clustering?

The descriptions here and here seem to suggest that hierarchical clustering is 'unsupervised'.

But does the pre-filtering of significant genes of interest make this 'supervised'? Is there more to the designation?

heatmap clustering rna-seq • 5.7k views
ADD COMMENTlink modified 19 months ago by Mozart240 • written 4.2 years ago by steve2.7k

Since you are using outside knowledge (differences between 2 known groups of samples) this would fall under supervised or semi-supervised clustering... However in a paper you could describe it as unsupervised clustering of differentially expressed genes and everyone would understand that it was semi supervised.

ADD REPLYlink written 2.9 years ago by reilly.brian.m60

Sorry for re-upping this post (it is always better than creating a new thread, I guess). So, if I got it right:

  1. when you extract RNA from samples (let's say treated with something or even a specific phenotype) and,
  2. when you have your genes annotated, and

you're asking how these genes cluster together then you are doing an unsupervised hierarchical clustering, correct?

ADD REPLYlink written 19 months ago by Mozart240

I've moved your post to a comment since it is not an answer. Use the "add comment" button to request clarifications.

Clustering is typically an unsupervised approach. Unsupervised means you don't use external information to group your data points/items, i.e. grouping is based only on the data. In supervised learning, you make use of external information to form the groups, typically category labels to train a classifier. There are also intermediate situations called semi-supervised learning in which clustering for example is constrained using some external information.
So if you apply hierarchical clustering to genes represented by their expression levels, you're doing unsupervised learning.

ADD REPLYlink written 19 months ago by Jean-Karim Heriche24k

Thanks so much, Jean-Karim. Very helpful.


ADD REPLYlink written 19 months ago by Mozart240
gravatar for Steven Lakin
4.2 years ago by
Steven Lakin1.5k
Fort Collins, CO, USA
Steven Lakin1.5k wrote:

This distinction has more to do with machine learning algorithm categories. While clustering is considered a subcategory of "machine learning," in your case what you're doing is mostly considered linear algebra.

Pre-filtering does not affect the category: the algorithm sees only the data, which in this case is an N-dimensional geometric space from which some sort of sample-wise distance is calculated. You can influence the way that clustering happens within pheatmap by using a different distance metric (e.g. "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski") or by changing algorithm parameters (pheatmap uses k-means, so changing k).

You can also read more about different hierarchical joining methods by reading up on hclust, which is the function underlying pheatmap:

Ward's minimum variance method aims at finding compact, spherical clusters. The complete linkage method finds similar clusters. The single linkage method (which is closely related to the minimal spanning tree) adopts a ‘friends of friends’ clustering strategy. The other methods can be regarded as aiming for clusters with characteristics somewhere between the single and complete link methods. Note however, that methods "median" and "centroid" are not leading to a monotone distance measure, or equivalently the resulting dendrograms can have so called inversions or reversals which are hard to interpret, but note the trichotomies in Legendre and Legendre (2012).

Supervised in most machine learning contexts means using prior information (prior data) in order to inform a decision about new data, given some category of algorithm.

Unsupervised means using only the data itself to make some decisions about the data, again given some category of algorithm.

Don't worry too much about this distinction for practical purposes, unless you're curious about the subject matter itself.

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Steven Lakin1.5k

Thanks for the clarification. I had gone looking for the documention on pheatmap to figure this out but couldn't find a vignette anywhere to describe the algorithms used. Agreed that it doesn't seem like much of a practical distinction, the motivation for finding this out was based more on the literal semantics of 'supervised' vs 'unsupervised' in regards to the experiment.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by steve2.7k
gravatar for Jean-Karim Heriche
4.2 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche24k wrote:

- unsupervised learning
- supervised learning

or this stats.stackexchange question.

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Jean-Karim Heriche24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2505 users visited in the last hour