Question: Substitution matrices to score variation between protein sequences?
gravatar for nchuang
4.2 years ago by
United States
nchuang210 wrote:

Trying to understand substitution matrices. It seems like it is a scoring scheme for alignments, particularly if you are looking for homology?

I am trying to see if it would be applicable if I am looking at mutations between proteins from different people. Since my sequences are very similar with only 1 or 2 mutations between them, the substitution matrix would probably not be applicable here? I am assuming if there is a nonsynonymous mutation between two sequences it would give me a score (say BLOSUM62) based on how likely that substitution would occur in nature? Are there other ways to interpret these scoring matrices?

blosum62 alignment • 1.2k views
ADD COMMENTlink modified 4.2 years ago by Steven Lakin1.5k • written 4.2 years ago by nchuang210
gravatar for Steven Lakin
4.2 years ago by
Steven Lakin1.5k
Fort Collins, CO, USA
Steven Lakin1.5k wrote:

Before we go into your question, it may be best and most concise to simply describe the exact SNP sites and leave it at that, given that your proteins are so similar. However, here are the differences in PAM and BLOSUM:

BLOSUM (BLOcks SUbstitution Matrix) were derived by looking at alignments of highly conserved protein domains at different evoluntionarily divergent distances, then taking into account how frequently one amino acid was substituted to another. It's described in this paper by Henikoff. They are based on local alignment of conserved protein regions.

PAM (Point Accepted Mutations) matrices were first described by Margaret Dayhoff (who was a fantastic scientist, even in face of the challenges of her role given the time period). "Each entry in a PAM matrix indicates the likelihood of the amino acid of that row being replaced with the amino acid of that column through a series of one or more point accepted mutations during a specified evolutionary interval, rather than these two amino acids being aligned due to chance." They are based on global alignment.

In short, this is what matters about the differences between the two:

  1. PAM matrices are typically used on more closely related proteins (such as your case), BLOSUM are typically used on more evolutionarily divergent proteins.
  2. The greater the PAM number the more DISTANT the sequences being compared should be; the greater the BLOSUM number, the more SIMILAR the sequences being compared should be.

So for your application, if you were to use these, you should either use a LOW PAM matrix or a HIGH BLOSUM matrix number. Whether this is appropriate for your application depends on what you want to get out of it (e.g. the whole protein difference or just local protein domain differences); you're right in that they are typically used for alignment scoring, but they can also be used to generate some evolutionary cost distance. However, there may be better methods out there for your purpose if you look for methods for creating distance trees based on some metric.

ADD COMMENTlink written 4.2 years ago by Steven Lakin1.5k

Fantastic answer !!

ADD REPLYlink written 4.2 years ago by Khader Shameer18k

wow this really clears it up. I read the intro to Biological Sequence Analysis by Durbin and understood it but didn't know how it was applied.

I am trying to figure out if these SNPs do affect function and was hoping maybe substitution matrix may offer some surrogate value.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by nchuang210
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 990 users visited in the last hour