4 months ago by
Kennedy Krieger Institute (Baltimore, MD)
Hello, here's some information about the origin of +5/-4.
Please see the section "BLASTN reward/penalty values" on the BLAST® Command Line Applications User Manual at https://www.ncbi.nlm.nih.gov/books/NBK279684/. That page offers Table D1 with the +5 for matches, -4 for mismatches scoring scheme as well as 11 other match/mismatch options used for BLASTN and MegaBLAST. (Table C2 also mentions some match/mismatch options for BLASTN, e.g. +1/-2 or +1/-3 scores.) The page states: "BLASTN uses a simple approach to score alignments, with identically matching bases assigned a reward and mismatching bases assigned a penalty. It is important to choose reward/penalty values appropriate to the sequences being aligned with the (absolute) reward/penalty ratio increasing for more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved ."
The citation  is to States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70. I don't see that article in PubMed, but it is online here: https://www.sciencedirect.com/science/article/pii/S1046202305801653
That States et al. paper mentions the BLASTN scoring scheme (+5 for matches, -4 for mismatches), citing the original BLAST paper:
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. PubMed PMID: 2231712. That's available via https://www.ncbi.nlm.nih.gov/pubmed/?term=2231712, and it just mentions the +5/-4 scores briefly.
NCBI also briefly offers some information about DNA substitution matrices here:
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head10 where they write: "Sometimes, however, one may wish to compare non-coding DNA sequences, at which point the same log-odds approach as before applies. An evolutionary model in which all nucleotides are equally common and all substitution mutations are equally likely yields different scores only for matches and mismatches . A more complex model, in which transitions are more likely than transversions, yields different "mismatch" scores for transitions and transversions . The best scores to use will depend upon whether one is seeking relatively diverged or closely related sequences ." That reference is again to States et al. (1991).