Question

Gene Expression Analyses and K-Means Clustering

0

Entering edit mode

9.6 years ago

Jay • 0

I understand that K-means clustering is used very often for gene expression analysis and usually dissimilarity is measured by euclidean distance but are there any particular applications of in which euclidean distance may not be the most appropriate tool in clustering? Is there any other way dissimilarity in gene expression be measured?

gene • 2.2k views

ADD COMMENT • link updated 9.6 years ago by Jean-Karim Heriche 27k • written 9.6 years ago by Jay • 0

score 0 · Answer 1 · 2016-04-17

There are plenty of measures to choose from. See for example the R function dist() for some commonly used ones (the R package proxy has more). The main problem with Euclidean distance is that it quickly tends towards a constant on noisy data as the number of dimensions increases and thus becomes useless. This is known as distance concentration. More on this here. All commonly used distance or similarity measures suffer from it to various degrees. What makes analysis possible despite this is the presence of structures/patterns in the data. In my experience however, the cosine distance (i.e. 1-cos) is more resistant to the concentration phenomenon than others, i.e. in the presence of noise, it may allow you to find meaningful clusters where Euclidean distance would fail.