Question

Computing Hamming similarity (opposite of distance) between string P and each position in string T

0

Entering edit mode

3.1 years ago

lehuyduc3 • 0

First of all, this is not a question about alignment, but more about string matching/searching, but it might be related.

My problem is in the following image, but I don't know what's its name.

Here P = 'agcga'.

The problem is defined as follows:

Input: string P length m, string T length n (P is a short sequence, N is a long genome)

Algorithm: for each position i in [0..n-m], compute res[i] = sum(P==T[i:i+m])

                   In other word, res[i] = the number of equal characters between P and T[i:i+m]

Output: array res[i]

Does anyone know the name of this problem? Or is it never used anywhere?

If it is used, then what's the current method to calculate res[] ?

Thank you.

alignment scoring programming • 1.5k views

ADD COMMENT • link 3.1 years ago by lehuyduc3 • 0

score 0 · Answer 1 · 2021-03-25

0

Entering edit mode

3.1 years ago

Istvan Albert 100k

What you describe appears to be the Hamming distance subtracted from the length of the string.

In a more pedestrian naming, what you have there is called the "number of matches" in an alignment.

A semi-global aligner with zero penalties for end gaps and not allowing internal gaps will find the maximal score in that res[i] vector (and the position of i)

(since the alignments are gapless other fast options may exists, but fast methods do not need to compute the full res[i]

ADD COMMENT • link 3.1 years ago by Istvan Albert 100k

0

Entering edit mode

Oh, so it doesn't have an unique name :\

So in case I want to compute the min(res[]) (instead of the full res[i]), which method can I use?

ADD REPLY • link 3.1 years ago by lehuyduc3 • 0

0

Entering edit mode

The minimum of that score would be the maximal Hamming distance.

There is a concept called maximal Hamming distance under rotation, that sounds more like what you have with some caveats: