Semi-global sequence alignment: What if there are two equally greatest elements in the last row?
1
1
Entering edit mode
8.1 years ago
eng442 ▴ 10

In semi-global AKA glocal alignment (definitions can be found here and here), the most common implementation is to adapt Needleman-Wunsch's algorithm. One of the modifications is to start the traceback at the greatest element of the last row of the alignment matrix AKA scores matrix (or last column, if there are more rows than columns) instead of starting the traceback at the absolute last element of the matrix. Here is a visual comparison:

But what if there are two equally greatest elements in the last row?

I've encountered such situation when re-implementing the algorithm - my first implementation, when looking for the greatest element, started looking from the beginning of the row, my newer implementation starts from the end. In one particular case the row was like below

0 // I = 0
.
.
.
4962 // I = 2481
4962 // I = 2482


So my first implementation would start the traceback at 2481, whereas my last implementation starts at 2482, giving two different alignments with two different scores. Which one is correct?

Intuitively I'd say using the smaller index is correct, since the whole point of doing a semi-global alignment is to have non penalized end gaps. And indeed in such case it gave the greatest score, but I'm not sure if that would be the case with a different sequence. I couldn't find anything on the subject. Thanks.

alignment sequencing • 7.5k views
0
Entering edit mode

I think generally, this isn't given much thought, but I'd say that even with semi-global, one rule when the scores are equal is that you want to consume as much of the longer sequence as possible. Again, that is only in the case of ties, and does not take precedent over the score.

0
Entering edit mode
8.1 years ago
Ram 37k

Excellent question!

I'd say that if you had equal scores, it would ideally translate to an equal number of matches, mismatches and indels in all paths. If you're using affine gap penalties, the chances for equal score are even lesser than if you're not using affine gap penalties.

Assuming the number of Ms, Xs and I/Ds are different for the paths, IMO the one with lesser gaps in the smaller sequence is the one to go for. Like you say, the purpose of semi global alignment is to seek where the smaller sequence aligns best in the larger sequence.

The best book I know on the subject, "Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids" by Durbin et al does not give much importance to this possibility, as far as I can recall. You might wanna check it out though, just to be sure.