Question

Gap penalty in smith waterman

1

Entering edit mode

9.5 years ago

gorilla.in.a.tie ▴ 20

Hi all,

I am looking for advice on how to calculate the gap and affine gap extension penalties that are used in the dynamic programming approaches to sequence alignment. I understand that the substitution matrices are simple lod scores, but always see somewhat hand wavy justifications of gap penalties.

As an aside, is there a reason why there doesn't seem to be nearly as much literature for substitution matrices in DNA as opposed to proteins - presumably there is a reason for this?

Cheers

sequence-alignment • 3.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by gorilla.in.a.tie ▴ 20

Ram · Answer 1 · 2014-11-04

1

Entering edit mode

9.5 years ago

Istvan Albert 100k

Interesting question on a subject I never actually thought about.

I would say the cause for lack of results on substitution matrices for DNA is that there are so few options: there are just three alternatives for which the score will depend on the context that it is being used in. In addition the nocoding DNA has a lot less conservation and a lot less defined functionality than the protein coding region - so it is hard to come up with a general rule.

As for gaps: the information in a mismatch is easy to capture and formalize, a gap's role will depend on what is being replaced, how long the gaps are etc.

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Istvan Albert 100k

0

Entering edit mode

So what you are saying is that it is necessary to calculate gap penalty for a given base matching, and use replacement base context? Do you have a sense of how people are generally coming up with the substitution matrices for smith waterman dna local alignment experiments - it seems to be mostly just qualitative choice. Is his a fair characterisation?

ADD REPLY • link 9.5 years ago by gorilla.in.a.tie ▴ 20

0

Entering edit mode

scoring is a measure of similarity - it is used to compare sequences and serves as a metric. For that to work properly it has to actually be able to quantify the differences. And when it comes to just DNA there is just not enough information - it is a bit like trying to infer someone's height from their shoe size. It works for the extreme cases - a baby vs Shaq - but it just does not contain information to properly characterize an average height person.

ADD REPLY • link 9.5 years ago by Istvan Albert 100k