What you need to do to demonstrate a negative expectation value depends on what your sequences look like.

The reason why you need the negative expectation value is so that you don't, on average, start falsely growing aligned regions by substitution. So the negative scores for mismatches have to be negative *enough* to counteract the 1-in-4 (or whatever) chance of randomly hitting a match.

Let's start with the simplest case, first - all the items in the sequence are completely independent of each other, and they're all equally probable. And let's say the alphabet size is 4, a,c,g, and t. Let's chose a scoring matrix Sa→b = 5 (if a = b) and -4 (if a ≠ b). We can see if the expectation value of this matrix on two random bases - and by extension, by two random strings, if the bases are independent - is negative by simply plugging through the possibilities. W/o loss of generality, say the first character is a. Then there's a 1/4 chance of getting a match score, +5, and a 3/4 chance of getting a mismatch score, -4; so the expectation value is (1/4*(+5) + 3/4*(-4)) = -7/4. If you didn't want to make that assumption about the first base being an a, you could say there was a 1/4 chance of that, a 1/4 chance of a c, etc.., so the expectation value would be 1/4 (-7/4) + 1/4 (-7/4) +.. = 4/4(-7/4) = -7/4. Equivalent is just demonstrating that the sum of each column is negative. That's certainly true here (3*-4 + 1*+5 = -7) and would still be true (if just barely) if the penalty for mismatch was -7/4 (3*(-7/4)+1*(+5) = -1/4).

If the probabilities are unequal, you can see what would happen by going through that last example, but by specifying the individual probabilities of each base instead of 1/4. So for instance, let's say the probability of c or g was 1/6, and that of a or t was 1/3. With the same scoring matrix (+5/-4), then the expected score if one character were an a would be (5*(1/3) + -4*(1+1+2)/6) = -1; and the expected score overall between two random letters would be easily calculated as (1/3,1/6,1/6,1/3) S (1/3,1/6,1/6,1/3)^T, or p^T S p, where p is the vector of individual probabilities. Matrices for which this product is negative for any (non-zero) p are called negative-definite matrices. Note that in this case the requirement for a negative expectation value is more stringent than above; +5/(-7/4) wouldn't work any more (but +5/-2 still would, barely: expectation value -1/18) because of the enhanced probability of a match on an a or t means the penalty has to be harsher for a mismatch.

So if the items are independent, two matrix-vector-multiplications are the worst you have to do to test for a negative expectation value. But if there are correlations between the items, then things get more complicated. Say we have a substitution matrix S=(+5 if match, -2 if mismatch), and we have 4 equal-probability bases *except* that after an a, there's never c or g. So you have two states, with character probabilities of (1/4,1/4,1/4,1/4) or (1/2,0,0,1/2). "It can be shown that" if you generated long runs of the bases yourself using those rules, you'd find that on average, the individual characters pop up with frequencies (1/3,1/6,1/6,1/3).

If you naively plug those frequencies into the equation above, we've seen that it suggests you could get away with a mismatch penalty of -2; but the correlations mess up that simple calculation. It turns out you'd need at least -2.5; certain matches are significantly more likely with the correlation than if the bases were independent. For instance, getting an "aa" or "at" in the middle of a long run has a probability (1/3)*(1/2) = 1/6, not 1/9 as you'd get assuming they were independent, and so you have an enhanced probability of getting a score of +10, pushing up the expectation value.

In the presence of correlations, if they're simple enough you may be able to reason through the probabilities and calculate the expectation value as above. If the correlations are complex, and long-range, you may have to demonstrate the negative expectation value empirically by generating lots of test cases and looking at the statistics; but note that to get good statistical significance you'll have to generate a surprisingly large number of test cases (especially with long-range correlations), and that it's easy to accidentally generate biased samples of those test cases.

Many books tell you how the scoring matrix is estimated. Durbin et al (1998) is one of the many.

32kSmith-Waterman is local aligner I don't think there is a pre-requisite for scoring to have expected valued as negative. Perhaps the substitution matrix needs to be that, is that what you mean? Unless you have specific need you should use standard schemes e.g. BLOSUM-62, which would allow to compare scoring with other people's results. Of course, that may not be what you want! You also have 1 to 4 (mers, I assume) long sequences, perhaps kmer matching and counting is what you are looking for not Smith-Waterman.

10Hi Alex,

I have updated my original post. I usually use the term scoring matrix and substitution matrix synonymously. Yes, a pre-req for SW is that the substitution matrix must have an expected value that is negative. Also, I have a specific application for which a substitution matrix does not exist. Also, the 1-4 mers was an example; my sequences can be much longer.

Thanks for the help.

0