Question: Clustering data - data transformation (log2) highly improves clustering but why?
gravatar for JJ
15 months ago by
JJ430 wrote:

Hi all,

So I have a dataset (ELISA data) for quite a few analytes for patients and healthy controls. Almost all analytes are highly significantly different in patients vs. controls (non-parametric test). So I produced a heatmap and the clustering was okish. However, after log2 transformation it's very good (I add a very small constant to all values avoid -Inf values as I have quite a few 0). If I convert all ELISA data to the same unit before taking the log2 I get an almost perfect clustering, which I would have expected since almost all analytes are highly significantly different between the two groups. But I am a bit worried it's not ok what I did and I do not understand why the clustering improved so much.

Any advice/input is highly appreciated!


ADD COMMENTlink modified 15 months ago • written 15 months ago by JJ430

What is the nature of your data ? What distance/similarity measure and what clustering algorithm are you using ? Log transformation is often used for skewed data. Have you looked at the distributions ? It is also important to pay attention to the assumptions made by the clustering algorithm.

ADD REPLYlink written 15 months ago by Jean-Karim Heriche18k

Thank you for your input. So my data is ELISA data.

  • First, I convert all values to the same unit (ng/ml)
  • Second, I add a small constant
  • Third, I take the log2
  • Forth, I scale (scale function in R)

Then I am using the heatmap.2 function in R with distance measure 'euclidean' and agglomeration method 'ward.D2'.

Yes, I had a look at the distribution of the data. Each analyte by itself is nowhere near a normal distribution (hence the non-parametric test). But all values from variables together form a nice bell-shape curve after all the steps I have stated. Each step improves it. Without the log2 but with scaling the distribution is still skewed. With the log2 it's not skewed anymore. However, I didn't know that this could affect clustering that much. Could that be the reason why?

ADD REPLYlink modified 15 months ago • written 15 months ago by JJ430

Ward's method, like k-means, favours roughly spherical-shaped clusters. Data with heavily skewed variables may lead to very elongated clusters that are not well captured by this method. Taking the log of a variable will reduce the skewness and typically makes the distribution closer to normal. You could try alternative clustering methods less sensitive to skewness such as single or average linkage. If the log-transformed data is close to normally distributed, you could do your statistical tests on the log-transformed data, using parametric tests would give you more power.

ADD REPLYlink written 15 months ago by Jean-Karim Heriche18k

Thanks for your input. I tried originally both non-parametric and parametric - but the difference was negligible.

ADD REPLYlink written 15 months ago by JJ430
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1742 users visited in the last hour