Question

inconsistency of Mfuzz clustering

0

Entering edit mode

8.4 years ago

Assa Yeroslaviz ★ 1.8k

Hi,

I have bin trying to understand the workflow behind the Mfuzz clustering and have found something really weird.

Just as a comparison, I have ran the exact same script three times to see the way Mfuzz cluster the genes into similar profiles. I have astonished to find out that even if running the exact same lines wit the same data set the genes are clustered into different clusters(=expression profiles).

What I did was to compare each cluster of one run with all the clusters of a second run. The result matrix of this comparison can be found here. I found out that most of the cluster has one big group of overlapped genes with some outliers, but other groups are really divergent and are spread across many clusters equally.

I was wondering whether this is a known (accepted?) issue or am I doing something wrong?

Has anyone encountered this problem before?

Thanks for your input,

Assa

expression-profile Mfuzz clustering • 2.9k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by Assa Yeroslaviz ★ 1.8k

Ram · Answer 1 · 2015-12-26

I am not familiar with Mfuzz but what I read from the documentation page is that it's an implementation of fuzzy c-means. As with many other clustering algorithms, you're only guaranteed a local minimum i.e. depending on your initialization choices and the structure of your data, you can end up with different solutions. It is customary to run such algorithms multiple times and return either the result with the smallest error or the result found most often. I don't know if Mfuzz does this for you but if not then you should probably do it.

However, from your description of the clusters, it seems to me that fuzzy c-means can't find any structure in your data. Fuzzy c-means is better suited to find clusters of spherical shape. Try setting the fuzziness parameter m to 1, this will give you the equivalent of k-means clustering, and see if you get any sensible result. If not then your data doesn't have structure that can be discovered by this kind of algorithms.

Note also that the metrics you use is also important. For example, don't use Euclidean distance with high-dimensional noisy data because you're very likely to run into the distance concentration issue.