Closed:Using Levenshtein Distances to determine predictions using kNN Regression
0
0
Entering edit mode
3.8 years ago

I am interested in predicting the stability of my proteins using a KNN regression model, however I would like to use instead of the sequences themselves, the levenshtein distances calculated as the embedding of my proteins as the input variables for my model. I do not wish to one hot encode them if possible.

My input table looks like this:

new_host  sequence    expression
FALSE     AQVPYGVS    0.039267878
FALSE     ASVPYGVSI   0.039267878
FALSE     STNLYGSGR   0.261456561
FALSE     NLYGSGLVR   0.265188519
FALSE     SLGPSNLYG   0.419680588
FALSE     ATSLGTTNG   0.145710993

For my output, I'm not sure if I can actually get a particular distance for each and if this would be used for the input of my KNN regression model.

My function to calculate the levenshtein distance:

def levenshtein(seq1, seq2):
    size_x = len(seq1) + 1
    size_y = len(seq2) + 1
    matrix = np.zeros ((size_x, size_y))
    for x in range(size_x):
        matrix [x, 0] = x
    for y in range(size_y):
        matrix [0, y] = y

    for x in range(1, size_x):
        for y in range(1, size_y):
            if seq1[x-1] == seq2[y-1]:
                matrix [x,y] = min(
                    matrix[x-1, y] + 1,
                    matrix[x-1, y-1],
                    matrix[x, y-1] + 1
                )
            else:
                matrix [x,y] = min(
                    matrix[x-1,y] + 1,
                    matrix[x-1,y-1] + 1,
                    matrix[x,y-1] + 1
                )
    print (matrix)
    return (matrix[size_x - 1, size_y - 1])

I wish to measure the nonconformity of each instance depending on the levenshtein distances with the following function:

  def compute(self, z, Z):
        """Return k-Nearest Neighbours (kNN) nonconformity measure.

        Parameters
        ----------
        z : array-like, shape (n_features,)
            Test vector, where n_features is the number of features.
        Z : array-like, shape (n_samples, n_features)
            Training vectors, where n_samples is the number of samples,
            n_features is the number of features.
        Returns
        -------
        r : float
            kNN nonconformity measure on z with respect to Z.
        """
        # Take the k smallest distances between z rows and zn and sum them.
        dist = cdist(Z, [z])[:,0]
        r = np.sort(dist)[:self.k].sum()

        return r
machine-learning sequence python alignment • 332 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 2177 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6