Question

Closed:Using Levenshtein Distances to determine predictions using kNN Regression

0

Entering edit mode

3.8 years ago

biohacker_tobe ▴ 80

I am interested in predicting the stability of my proteins using a KNN regression model, however I would like to use instead of the sequences themselves, the levenshtein distances calculated as the embedding of my proteins as the input variables for my model. I do not wish to one hot encode them if possible.

My input table looks like this:

new_host  sequence    expression
FALSE     AQVPYGVS    0.039267878
FALSE     ASVPYGVSI   0.039267878
FALSE     STNLYGSGR   0.261456561
FALSE     NLYGSGLVR   0.265188519
FALSE     SLGPSNLYG   0.419680588
FALSE     ATSLGTTNG   0.145710993

For my output, I'm not sure if I can actually get a particular distance for each and if this would be used for the input of my KNN regression model.

My function to calculate the levenshtein distance:

def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    print (matrix)
    return (matrix[size_x - 1, size_y - 1])

I wish to measure the nonconformity of each instance depending on the levenshtein distances with the following function:

  def compute(self, z, Z):
        """Return k-Nearest Neighbours (kNN) nonconformity measure.

        Parameters
        ----------
        z : array-like, shape (n_features,)
            Test vector, where n_features is the number of features.
        Z : array-like, shape (n_samples, n_features)
            Training vectors, where n_samples is the number of samples,
            n_features is the number of features.
        Returns
        -------
        r : float
            kNN nonconformity measure on z with respect to Z.
        """
        # Take the k smallest distances between z rows and zn and sum them.
        dist = cdist(Z, [z])[:,0]
        r = np.sort(dist)[:self.k].sum()

        return r

machine-learning sequence python alignment • 332 views

ADD COMMENT • link 3.8 years ago by biohacker_tobe ▴ 80