Smith Waterman/String Exact Match With A Single Indel
1
2
Entering edit mode
11.0 years ago
Gabriel R. ★ 2.9k

Imagine that you have two strings of approx. equal length. You can perform an almost exact match on both while allowing for mismatches.

GCGTGAACGCATTTGCCGGTGATCGT
******          * *
GCGTGAGACGCATTTGCCGGTGATCG


Now imagine that you could insert a insertion in either one of them and have an even better match.

GCGTGAACG-CATTTGCCGGTGATCGT
********* *****************
GCGTGAGACGCATTTGCCGGTGATCGT


How would you go about computing this extremely efficiently ? It's not a Smith-Waterman with a restricted search space since SW does not "know" what came before (in terms of operations) in the search upon computation of the next square in the matrix. If I take an insert once, I want it to be forbidden to take another in either string.

edit: I do not want seeding heuristics, I want an exact answer. edit2: I want the global alignment, not a local one

alignment algorithm • 3.3k views
1
Entering edit mode

Can you clarify whether you want a global (end-to-end) alignment between the strings, or a local alignment? Do you want to allow mismatches as well as the single indel?

0
Entering edit mode

done, thanks for the comment !

0
Entering edit mode

What about allowing mismatches as well as the indel? If you require an exact match of the regions flanking the indel I believe there is a straightforward solution, but if these regions can have mismatches then it is more complicated.

0
Entering edit mode

no, you can have mismatches in those.

0
Entering edit mode

I'm missing something .. why not just using a function like the C strcspn ( http://www.cplusplus.com/reference/cstring/strcspn/ ) and compare the two strings a the location of the first mismatch ?

0
Entering edit mode

I'm missing something too, how does strcspn solve this ? It will just return the index of an ACGT (since all are in string2) in the first string which is likely to return almost always 0.

0
Entering edit mode

sorry that was the wrong C function. I was think of a C function that would return the index of the first occurence of difference between two strings. e.g: http://www.cplusplus.com/reference/string/string/find_first_not_of/

0
Entering edit mode

still, that is exact matching. How do you handle gaps ? How do you know where the gap occurs ? You check all positions ? It's more an algorithmic question rather than a programmatic one.

0
Entering edit mode
11.0 years ago
matted 7.8k

So the point is you want to do a Needleman-Wunsch global aligment with AT MOST one gap?

I believe you can do this by applying the ideas of the affine gap penalty extension to Needleman-Wunsch (see random lecture notes here or search for better ones).

The rough idea is that you have two parallel dynamic programming matrices that you fill in. One is calculating alignments without/before the single gap, and the other is calculating alignments with a single gap. At any position in the matrices, you can go from the first to the second with a gap move (corresponding to adding the gap), but never from the second back to the first (this prohibits adding another gap). You'll have to tailor the algorithm to only allow mismatch moves from the first matrix to the first, and the second matrix to the second.

0
Entering edit mode

i do not get it, why would I need a matrix to do a straightforward match ?

0
Entering edit mode

If you want to consider gaps and mismatches together, you can't do it greedily. Dynamic programming is one way to get the optimal answer efficiently. Read more here or here (Googled randomly).

0
Entering edit mode

I know very well NW, actually, in hindsight I should not have written Smith Waterman, but rather Needleman-Wunsch since I seek a global alignment. But I seek a NW where a single "gap/deletion/insertion" is allowed in the entire alignment.

0
Entering edit mode

Yes, I know, that's what my answer above gives you.

0
Entering edit mode

I am slow today due to the heat. What does the first matrix represent ? In a normal NW, it represents the best score for certain indels/mismatches. In my case, I only allow one indel.

0
Entering edit mode

Strictly speaking, the single matrix in NW represents the best alignment score for the given prefixes of the query strings. The way you fill up the matrix (the dynamic programming algorithm) encodes what mismatches or gaps you permit and their relative scores. The path[s] of optimal choices going from cell to cell encodes the optimal alignment.

I explained the two matrices in my answer. The first matrix is computed using only mismatches, using only its own cells. Formally, it represents the best alignment score for string prefixes using NO gaps. The second matrix is computed using gaps from entries in the first matrix or mismatches from entries in the second matrix. Formally, it represents the best alignment score for string prefixes using ONE gap. The property that ensures there is at most one gap is that you start in the first matrix, finish in the second matrix, and can only go from the first to the second exactly once (in the alignment path). The single move from the first to the second matrix is where the one gap is applied.

I am sorry I don't have time to give you fully worked out code, but read more about the three-matrix affine gap score NW method and you will understand this. I promise it will work.