Question: Gap penalty in smith waterman
I am looking for advice on how to calculate the gap and affine gap extension penalties that are used in the dynamic programming approaches to sequence alignment. I understand that the substitution matrices are simple lod scores, but always see somewhat hand wavy justifications of gap penalties.

As an aside, is there a reason why there doesn't seem to be nearly as much literature for substitution matrices in DNA as opposed to proteins - presumably there is a reason for this?



sequence alignment • 2.0k views
Interesting question on a subject I never actually thought about. 

I would say the cause for lack of results on substitution matrices for DNA is that there are so few options: there are just three alternatives for which the score will depend on the context that it is being used in. In addition the nocoding DNA has a lot less conservation and a lot less defined functionality than the protein coding region - so it is hard to come up with a general rule. 

As for gaps: the information in a mismatch is easy to capture and formalize, a gap's role will depend on what is being replaced, how long the gaps are etc. 

So what you are saying is that it is necessary to calculate gap penalty for a given base matching, and use replacement base context? Do you have a sense of how people are generally coming up with the substitution matrices for smith waterman dna local alignment experiments - it seems to be mostly just qualitative choice. Is his a fair characterisation?

scoring is a measure of similarity - it is used to compare sequences and serves as a metric. For that to work properly it has to actually be able to quantify the differences. And when it comes to just DNA there is just not enough information - it is a bit like trying to infer someone's height from their shoe size. It works for the extreme cases - a baby vs Shaq - but it just does not contain information to properly characterize an average height person.

