Hi everyone !
I am working on a K-medoid algorithm, but I have trouble understanding how unsupervised learning works, your help will be more than welcome.
Each index is an organism. So if length of the matrix is 5 x 5, then there are 5 organisms to be compared (in my real dataset I have a lot more ^^), and index 0 means first organism and so on.
The distance matrix represents the distance of Manhattan calculated between two organisms, from the gene function present or absent (COG).
Here is my code (which works so far), I couldn't do the last part - the iteration (this is just a start) - because I can't figure out when my loop should stop : should I put a threshold myself, like the number of iteration maximum ?
Is there an automatic way to do that ?
What do you think of this option (line
while not ((old_medoids == curr_medoids).all()): ) ? I think it's too ramdom, just expecting that we get twice the same group of medoids...
And how does the computer "learn" not to go back the points that doesn't optimise the algorithm, and keep the "profitable" points ? Should I create a list of elements "already seen - not to be done" ?
And last optional question : I need to create a function to optimize k (k being the number of clusters chosen, by the user or by the computer). I do not have a clue how I could do that.
If you have any idea in any of those points or comments about my code, thank you in advance :)
import random as rd import numpy as np # DATA = distances matrix # matrix  => distance between point 2 and point 1 # matrix is symetric => matrix  is the same as  but we only use  distances = np. array([[ 0. , 0. , 0. , 0. , 0. ], [ 0.64745477, 0. , 0. , 0. , 0. ], [ 0.36053849, 0.57299117, 0. , 0. , 0. ], [ 0.3580143 , 0.59402608, 0.18258309, 0. , 0. ], [ 0.35128313, 0.615061 , 0.23727387, 0.24821203, 0. ]]) def initialize_medoids(distances, k=2): ''' Take the distances matrix, send back a list of k random elements => called medoids => example : [4, 2] ''' first_medoids = rd.sample([i for i in range(len(distances))], k) return first_medoids def associate_point_to_medoids(current_medoids, distances): ''' Each point (not in the medoid list) is associated with his closest medoid => result in a list of lists : [[medoid1, his closest medoids], [medoid2, his closest medoids]] => example : [[4, 1], [2, 3, 0]] ''' configuration = [[x] for x in current_medoids] for i in range(len(distances)): if i not in current_medoids: d = 1 associate = -1 for j in range(len(current_medoids)): if i > current_medoids[j] : if d > distances[i][j]: d = distances[i][j] associate = j if i < current_medoids[j] : if d > distances[j][i]: d = distances[j][i] associate = j configuration[associate].append(i) return configuration def calculate_cost(configuration, distances): ''' Sum the distances between points and their respective medoids. => take this type of argument : [[4, 1], [2, 3, 0]] ''' summ = 0 for elt in configuration : for i in range(1, len(elt)): if elt > elt[i] : summ += distances[elt][elt[i]] else : summ += distances[elt[i]][elt] return summ def switch_medoids(n, current_medoids): ''' n : define which medoids we're going to change => send back new_medoids ''' new_medoids = list(current_medoids) index = new_medoids.index(n) new_medoids[index] = rd.choice([i for i in range(len(distances)) if i not in current_medoids]) return new_medoids first_medoids = initialize_medoids(distances) clusters = associate_point_to_medoids(first_medoids, distances) old_cost = calculate_cost(clusters, distances) # iteration manual... n = first_medoids new_medoids = switch_medoids(n, first_medoids) new_clusters = associate_point_to_medoids(new_medoids, distances) new_cost = calculate_cost(new_clusters, distances) if new_cost > old_cost : # keep the old medoid => change only the index n = first_medoids print 'no, the old configuration was good' #reiterate else : # keep the new medoid => we can keep the same index or change it n = new_medoids print 'yes, keep this one' # reiterate
Here are a sample of the kind of results that can be obtained :
first_medoids Out: [3, 4] clusters Out: [[3, 0, 1, 2], ] old_cost Out: 1.13462347 new_medoids Out: [3, 1] new_clusters Out: [[3, 0, 2, 4], ] new_cost Out: 0.78880942000000009