Question: Nucleotide substitution matrix origin
gravatar for laura_savana
4 months ago by
laura_savana20 wrote:

Hi, I'm new of Biostars. I think this forum is very interesting and useful. I'm sure that I find the answer to my question here.

My question concern the most used nucleotide substitution matrix: + 5 / -4. I found that this values are probabilities rounded to nearest integer and that Todd Lowe created this substitution matrix, but I need to read the article (if exists) in which is explained how such matrix was built. Anyone knows if exists this article or if there is documentation about substitution matrix (+5/-4) building? If doesn't exitst, anyone knows what were the steps with which the matrix was built?

Thank you so much!

alignment • 229 views
ADD COMMENTlink modified 4 months ago by pevsner420 • written 4 months ago by laura_savana20

I am not sure, but I don't think there is a paper documenting this matrix, and if there is online documentation, it is really difficult to find. There is a mention on this BioStars thread ( Nucleotide Substitution Matrix With Iupac Nucleotide Ambiguity Codes ) it was derived from an alignment of distantly related sequences obtained with the FastA software. EDNAFULL is also mentioned on this EMBOSS mailing list thread, but no details on how it was built:

ADD REPLYlink written 4 months ago by h.mon23k

Thank you so much h.mon.

In my researches I found the EMBOSS mailing and also files on github where was showed the same output (commented lines code in which appear creator's name and last name of substitution matrix and representation of such matrix). Maybe I can find a guideline in BioStars thread that you indicate to me.

I'm trying to find this information from 7 months, but without results.
Is it such impossible find how nucleotide substitution matrix was built?
I don't understand...I find only the explanation about aminoacid substitution matrices building.

Thank you again.

ADD REPLYlink written 4 months ago by laura_savana20
gravatar for pevsner
4 months ago by
Kennedy Krieger Institute (Baltimore, MD)
pevsner420 wrote:

Hello, here's some information about the origin of +5/-4.

Please see the section "BLASTN reward/penalty values" on the BLAST® Command Line Applications User Manual at That page offers Table D1 with the +5 for matches, -4 for mismatches scoring scheme as well as 11 other match/mismatch options used for BLASTN and MegaBLAST. (Table C2 also mentions some match/mismatch options for BLASTN, e.g. +1/-2 or +1/-3 scores.) The page states: "BLASTN uses a simple approach to score alignments, with identically matching bases assigned a reward and mismatching bases assigned a penalty. It is important to choose reward/penalty values appropriate to the sequences being aligned with the (absolute) reward/penalty ratio increasing for more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved [2]."

The citation [2] is to States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70. I don't see that article in PubMed, but it is online here: That States et al. paper mentions the BLASTN scoring scheme (+5 for matches, -4 for mismatches), citing the original BLAST paper:

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. PubMed PMID: 2231712. That's available via, and it just mentions the +5/-4 scores briefly.

NCBI also briefly offers some information about DNA substitution matrices here: where they write: "Sometimes, however, one may wish to compare non-coding DNA sequences, at which point the same log-odds approach as before applies. An evolutionary model in which all nucleotides are equally common and all substitution mutations are equally likely yields different scores only for matches and mismatches [32]. A more complex model, in which transitions are more likely than transversions, yields different "mismatch" scores for transitions and transversions [32]. The best scores to use will depend upon whether one is seeking relatively diverged or closely related sequences [32]." That reference is again to States et al. (1991).

ADD COMMENTlink written 4 months ago by pevsner420

Thank you so much!!!

ADD REPLYlink written 4 months ago by laura_savana20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 743 users visited in the last hour