Question

Does Needlman-Wunsch Algorithm Works With Float Values In Similarity Matrix?

3

Entering edit mode

12.1 years ago

Shan ▴ 50

Hi

I want to do global sequence aligment. Referring to the wikipedia article of Needleman-Wunsch for global sequence alignment. I could see the following similarity matrix

            A       G        C        T
   A       10      -1       -3       -4
   G       -1       7       -5       -3
   C       -3      -5        9        0
   T       -4      -3        0        8

Instead of having these values I want to deal with the probability. So all the values will not be integers but real values.

In all of the matrices I had seen e.g BLOSUM50, all have the integer values. My question is that if the similarity matrix will be filled with real values. Would it work the same way.

Another thing is that, is there any special procedure to construct the similarity matrix or it could be any distance measure.

Thanks a lot.

sequence alignment similarity • 3.3k views

ADD COMMENT • link updated 12.1 years ago by Konrad ▴ 80 • written 12.1 years ago by Shan ▴ 50

0

Entering edit mode

Wouldn't you rather want to use a hidden markov model when working with probabilities?

ADD REPLY • link 12.1 years ago by Niek De Klein ★ 2.6k

score 10 · Answer 1 · 2012-03-13

The algorithm does not exclude floating point numbers, but many implementations of the algorithm do (e.g. ggsearch36 in the FASTA package), or the older align.c program in the FASTA2 package use integer matrices, because for a long time, integer arithmetic was much faster.

There are several papers on producing similarity matrices at arbitrary evolutionary distances. The Jones, Taylor, and Thornton paper (Comp. Appl. Biosci 1992 8:275-282, Comp. Appl. Biosci. is now Bioinformatics) is widely cited and their matrices widely used. More recently, Mueller, Spang and Vingron have a series of papers, including Mol. Biol. Evol., 2002 19:8-13.

However, it is not clear to me that different scoring matrices will have much effect on global (Needleman Wunsch) sequence alignments. Since those alignments are global, they must align from beginning to end, and the only difference different scoring matrices will have will depend on the gap penalties. For local alignments, the matrices matter a lot more (see Altschul's classic paper: J. Mol. Biol. (1991) 219:555).

score 2 · Answer 2 · 2012-03-10

2

Entering edit mode

12.1 years ago

Fabian Bull ★ 1.3k

First of all: Yes it is possible. There is absolutly no reason why one should not use float values.

But keep in mind: The scores of the popular substitution-matrizes are computed using probabilities so I do not expect better results.

As stated in the comments: An natural of combining probabilities and alignments are Pair-HMMs. With these you can set your desired probabilities and ask for the optimal alignment (viterbi-algorithm).

ADD COMMENT • link 12.1 years ago by Fabian Bull ★ 1.3k

0

Entering edit mode

@peri4n thanks a lot for the answer... I only have the match probabilities not transition ones... I will dig more into Pair-HMMs, I am a newbee to HMMs... It would be very nice of you if you could please recommend standard libraries and code plus usage examples... Thanks a lot...

ADD REPLY • link 12.1 years ago by Shan ▴ 50

score 0 · Answer 3 · 2012-03-30

Using floating point values with NW is fine; however, using probabilities is not.

If you want to work with probabilities of alignments you need to multiply their values instead of adding. The Needleman-Wunsch algorithm uses addition since it deals with the logarithm of probabilities. This is sound since log(x * y) = log x + log y.

For a good introduction into the statistical motivation of sequence alignment algorithms, see e.g. Biological Sequence Analysis by Durbin & al.

The whole reason why matrices such as BLOSUM use integral values is that they allow vastly faster computation than floating point numbers. Furthermore, these matrices are actually derived from log’ed (and scaled) probabilities.