Entering edit mode
3.8 years ago
biohacker_tobe
▴
80
I am interested in predicting the stability of my proteins using a KNN regression model, however I would like to use instead of the sequences themselves, the levenshtein distances calculated as the embedding of my proteins as the input variables for my model. I do not wish to one hot encode them if possible.
My input table looks like this:
new_host sequence expression
FALSE AQVPYGVS 0.039267878
FALSE ASVPYGVSI 0.039267878
FALSE STNLYGSGR 0.261456561
FALSE NLYGSGLVR 0.265188519
FALSE SLGPSNLYG 0.419680588
FALSE ATSLGTTNG 0.145710993
For my output, I'm not sure if I can actually get a particular distance for each and if this would be used for the input of my KNN regression model.
My function to calculate the levenshtein distance:
def levenshtein(seq1, seq2):
size_x = len(seq1) + 1
size_y = len(seq2) + 1
matrix = np.zeros ((size_x, size_y))
for x in range(size_x):
matrix [x, 0] = x
for y in range(size_y):
matrix [0, y] = y
for x in range(1, size_x):
for y in range(1, size_y):
if seq1[x-1] == seq2[y-1]:
matrix [x,y] = min(
matrix[x-1, y] + 1,
matrix[x-1, y-1],
matrix[x, y-1] + 1
)
else:
matrix [x,y] = min(
matrix[x-1,y] + 1,
matrix[x-1,y-1] + 1,
matrix[x,y-1] + 1
)
print (matrix)
return (matrix[size_x - 1, size_y - 1])
I wish to measure the nonconformity of each instance depending on the levenshtein distances with the following function:
def compute(self, z, Z):
"""Return k-Nearest Neighbours (kNN) nonconformity measure.
Parameters
----------
z : array-like, shape (n_features,)
Test vector, where n_features is the number of features.
Z : array-like, shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples,
n_features is the number of features.
Returns
-------
r : float
kNN nonconformity measure on z with respect to Z.
"""
# Take the k smallest distances between z rows and zn and sum them.
dist = cdist(Z, [z])[:,0]
r = np.sort(dist)[:self.k].sum()
return r