Question: Clustering data - data transformation (log2) highly improves clustering but why?
0
21 months ago by
JJ440
JJ440 wrote:

Hi all,

So I have a dataset (ELISA data) for quite a few analytes for patients and healthy controls. Almost all analytes are highly significantly different in patients vs. controls (non-parametric test). So I produced a heatmap and the clustering was okish. However, after log2 transformation it's very good (I add a very small constant to all values avoid -Inf values as I have quite a few 0). If I convert all ELISA data to the same unit before taking the log2 I get an almost perfect clustering, which I would have expected since almost all analytes are highly significantly different between the two groups. But I am a bit worried it's not ok what I did and I do not understand why the clustering improved so much.

Thanks

modified 21 months ago • written 21 months ago by JJ440
1

What is the nature of your data ? What distance/similarity measure and what clustering algorithm are you using ? Log transformation is often used for skewed data. Have you looked at the distributions ? It is also important to pay attention to the assumptions made by the clustering algorithm.

Thank you for your input. So my data is ELISA data.

• First, I convert all values to the same unit (ng/ml)
• Second, I add a small constant
• Third, I take the log2
• Forth, I scale (scale function in R)

Then I am using the heatmap.2 function in R with distance measure 'euclidean' and agglomeration method 'ward.D2'.

Yes, I had a look at the distribution of the data. Each analyte by itself is nowhere near a normal distribution (hence the non-parametric test). But all values from variables together form a nice bell-shape curve after all the steps I have stated. Each step improves it. Without the log2 but with scaling the distribution is still skewed. With the log2 it's not skewed anymore. However, I didn't know that this could affect clustering that much. Could that be the reason why?

1

Ward's method, like k-means, favours roughly spherical-shaped clusters. Data with heavily skewed variables may lead to very elongated clusters that are not well captured by this method. Taking the log of a variable will reduce the skewness and typically makes the distribution closer to normal. You could try alternative clustering methods less sensitive to skewness such as single or average linkage. If the log-transformed data is close to normally distributed, you could do your statistical tests on the log-transformed data, using parametric tests would give you more power.