inconsistency of Mfuzz clustering
1
0
Entering edit mode
8.4 years ago
Assa Yeroslaviz ★ 1.8k

Hi,

I have bin trying to understand the workflow behind the Mfuzz clustering and have found something really weird.

Just as a comparison, I have ran the exact same script three times to see the way Mfuzz cluster the genes into similar profiles. I have astonished to find out that even if running the exact same lines wit the same data set the genes are clustered into different clusters(=expression profiles).

What I did was to compare each cluster of one run with all the clusters of a second run. The result matrix of this comparison can be found here. I found out that most of the cluster has one big group of overlapped genes with some outliers, but other groups are really divergent and are spread across many clusters equally.

I was wondering whether this is a known (accepted?) issue or am I doing something wrong?

Has anyone encountered this problem before?

Thanks for your input,

Assa

expression-profile Mfuzz clustering • 2.9k views
ADD COMMENT
4
Entering edit mode
8.3 years ago

I am not familiar with Mfuzz but what I read from the documentation page is that it's an implementation of fuzzy c-means. As with many other clustering algorithms, you're only guaranteed a local minimum i.e. depending on your initialization choices and the structure of your data, you can end up with different solutions. It is customary to run such algorithms multiple times and return either the result with the smallest error or the result found most often. I don't know if Mfuzz does this for you but if not then you should probably do it.

However, from your description of the clusters, it seems to me that fuzzy c-means can't find any structure in your data. Fuzzy c-means is better suited to find clusters of spherical shape. Try setting the fuzziness parameter m to 1, this will give you the equivalent of k-means clustering, and see if you get any sensible result. If not then your data doesn't have structure that can be discovered by this kind of algorithms.

Note also that the metrics you use is also important. For example, don't use Euclidean distance with high-dimensional noisy data because you're very likely to run into the distance concentration issue.

ADD COMMENT

Login before adding your answer.

Traffic: 2832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6