Question

On what condition(s) should a K-medoids algorithms stop when clustering organisms ?

1

Entering edit mode

8.0 years ago

nlehmann ▴ 140

Hi everyone !

I am working on a K-medoid algorithm, but I have trouble understanding how unsupervised learning works, your help will be more than welcome.

Each index is an organism. So if length of the matrix is 5 x 5, then there are 5 organisms to be compared (in my real dataset I have a lot more ^^), and index 0 means first organism and so on.

The distance matrix represents the distance of Manhattan calculated between two organisms, from the gene function present or absent (COG).

Here is my code (which works so far), I couldn't do the last part - the iteration (this is just a start) - because I can't figure out when my loop should stop : should I put a threshold myself, like the number of iteration maximum ?

Is there an automatic way to do that ?

What do you think of this option (line while not ((old_medoids == curr_medoids).all()): ) ? I think it's too ramdom, just expecting that we get twice the same group of medoids...

And how does the computer "learn" not to go back the points that doesn't optimise the algorithm, and keep the "profitable" points ? Should I create a list of elements "already seen - not to be done" ?

And last optional question : I need to create a function to optimize k (k being the number of clusters chosen, by the user or by the computer). I do not have a clue how I could do that.

If you have any idea in any of those points or comments about my code, thank you in advance :)

import random as rd
import numpy as np


# DATA = distances matrix
# matrix [2][1] => distance between point 2 and point 1
# matrix is symetric => matrix [2][1] is the same as [1][2] but we only use [2][1]
distances = np. array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.64745477,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.36053849,  0.57299117,  0.        ,  0.        ,  0.        ],
       [ 0.3580143 ,  0.59402608,  0.18258309,  0.        ,  0.        ],
       [ 0.35128313,  0.615061  ,  0.23727387,  0.24821203,  0.        ]])


def initialize_medoids(distances, k=2):
    '''
    Take the distances matrix, send back a list of k random elements => called medoids
    => example : [4, 2]
    '''
    first_medoids = rd.sample([i for i in range(len(distances))], k)
    return first_medoids

def associate_point_to_medoids(current_medoids, distances):
    '''
    Each point (not in the medoid list) is associated with his closest medoid
    => result in a list of lists : [[medoid1, his closest medoids], [medoid2, his closest medoids]]
    => example : [[4, 1], [2, 3, 0]]
    '''
    configuration = [[x] for x in current_medoids]
    for i in range(len(distances)):
        if i not in current_medoids:
            d = 1
            associate =  -1
            for j in range(len(current_medoids)):
                if i > current_medoids[j]  :
                    if d > distances[i][j]:
                        d = distances[i][j]
                        associate = j
                if i < current_medoids[j] :
                    if d > distances[j][i]:
                        d = distances[j][i]
                        associate = j
            configuration[associate].append(i)
    return configuration

def calculate_cost(configuration, distances):
    '''
    Sum the distances between points and their respective medoids.
    => take this type of argument : [[4, 1], [2, 3, 0]]
    '''
    summ = 0
    for elt in configuration :
        for i in range(1, len(elt)):
            if elt[0] > elt[i] :
                summ += distances[elt[0]][elt[i]]
            else :
                summ += distances[elt[i]][elt[0]]
    return summ

def switch_medoids(n, current_medoids):       
    '''
     n : define which medoids we're going to change
     => send back new_medoids
    '''
    new_medoids = list(current_medoids)
    index = new_medoids.index(n)
    new_medoids[index] = rd.choice([i for i in range(len(distances)) if i not in current_medoids])
    return new_medoids

first_medoids = initialize_medoids(distances)
clusters = associate_point_to_medoids(first_medoids, distances)
old_cost = calculate_cost(clusters, distances)

# iteration manual...
n = first_medoids[1]
new_medoids = switch_medoids(n, first_medoids)
new_clusters = associate_point_to_medoids(new_medoids, distances)
new_cost = calculate_cost(new_clusters, distances)
if new_cost > old_cost :
    # keep the old medoid  => change only the index
    n = first_medoids[0]
    print 'no, the old configuration was good'
    #reiterate
else :
    # keep the new medoid => we can keep the same index or change it
    n = new_medoids[1]
    print 'yes, keep this one'
    # reiterate

Here are a sample of the kind of results that can be obtained :

first_medoids
Out[358]: [3, 4]

clusters
Out[359]: [[3, 0, 1, 2], [4]]

old_cost
Out[360]: 1.13462347

new_medoids
Out[361]: [3, 1]

new_clusters
Out[362]: [[3, 0, 2, 4], [1]]

new_cost
Out[363]: 0.78880942000000009

PAM Clustering K-medoids Unsupervised learning • 5.4k views

ADD COMMENT • link updated 8.0 years ago by Jean-Karim Heriche 27k • written 8.0 years ago by nlehmann ▴ 140

score 3 · Answer 1 · 2016-04-29

I am not sure this post belongs here as this is not a bioinformatics question per se but I'll try to give you some pointers. k-medoids clustering is usually done using the partitioning around medoids (PAM) algorithm which is guaranteed to converge to a local minimum and this is considered reached when there's no change in the clusters and since the clusters are defined by their medoids, you can check that these haven't changed. In some cases, convergence can be very slow so most implementations stop after a predefined maximum number of iterations. For ideas, you can have a look at my implementation of k-means in my GMM.pm perl module or here and here for some explanations on the algorithm.

I am not sure I quite understand the question about "not going back". The algorithm guarantees that at each iteration the solution is better than the previous one. Note that since the algorithm can converge to a local optimum, the result depends on the starting conditions (i.e. the choice of initial medoids) so it is usually a good idea to run the clustering several times with different starting points.

The choice of the number of clusters in unsupervised learning is a difficult question and has a long history. The question is ill-posed because very often there is no ground truth and the granularity, i.e. the number and size of the clusters, depends on what you want out of the data. For example, imagine clustering some proteins into functional modules, some biologists may consider that the 80S ribosome is the module of interest whereas others may wish to distinguish between the 40S and 60S subunits. From a biological point of view, both are valid clustering outcomes but depending on the algorithm and the choice of k, you may get one or the other.

Finally, you should be aware that there are already implementations of k-medoids in most programming languages:

R: kmeans(...,algorithm="Lloyd")

perl: Algorithm::Cluster

python: on GitHub here