Hello, here's some information about the origin of +5/-4.
Please see the section "BLASTN reward/penalty values" on the BLAST® Command Line Applications User Manual at https://www.ncbi.nlm.nih.gov/books/NBK279684/. That page offers Table D1 with the +5 for matches, -4 for mismatches scoring scheme as well as 11 other match/mismatch options used for BLASTN and MegaBLAST. (Table C2 also mentions some match/mismatch options for BLASTN, e.g. +1/-2 or +1/-3 scores.) The page states: "BLASTN uses a simple approach to score alignments, with identically matching bases assigned a reward and mismatching bases assigned a penalty. It is important to choose reward/penalty values appropriate to the sequences being aligned with the (absolute) reward/penalty ratio increasing for more divergent sequences. A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved [2]."
The citation [2] is to States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70. I don't see that article in PubMed, but it is online here: https://www.sciencedirect.com/science/article/pii/S1046202305801653
That States et al. paper mentions the BLASTN scoring scheme (+5 for matches, -4 for mismatches), citing the original BLAST paper:
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. PubMed PMID: 2231712. That's available via https://www.ncbi.nlm.nih.gov/pubmed/?term=2231712, and it just mentions the +5/-4 scores briefly.
NCBI also briefly offers some information about DNA substitution matrices here:
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html#head10 where they write: "Sometimes, however, one may wish to compare non-coding DNA sequences, at which point the same log-odds approach as before applies. An evolutionary model in which all nucleotides are equally common and all substitution mutations are equally likely yields different scores only for matches and mismatches [32]. A more complex model, in which transitions are more likely than transversions, yields different "mismatch" scores for transitions and transversions [32]. The best scores to use will depend upon whether one is seeking relatively diverged or closely related sequences [32]." That reference is again to States et al. (1991).
I am not sure, but I don't think there is a paper documenting this matrix, and if there is online documentation, it is really difficult to find. There is a mention on this BioStars thread ( Nucleotide Substitution Matrix With Iupac Nucleotide Ambiguity Codes ) it was derived from an alignment of distantly related sequences obtained with the FastA software. EDNAFULL is also mentioned on this EMBOSS mailing list thread, but no details on how it was built:
http://lists.open-bio.org/pipermail/emboss/2005-December/006889.html
Thank you so much h.mon.
In my researches I found the EMBOSS mailing and also files on github where was showed the same output (commented lines code in which appear creator's name and last name of substitution matrix and representation of such matrix). Maybe I can find a guideline in BioStars thread that you indicate to me.
I'm trying to find this information from 7 months, but without results.
Is it such impossible find how nucleotide substitution matrix was built?
I don't understand...I find only the explanation about aminoacid substitution matrices building.
Thank you again.