I am looking at SRR015016 from the SRA.
I am trying to understand the encoding of the base quality used in this file.
The instrument model was Illumina Genome Analyzer II. However, the quality scheme is somewhat peculiar.
I have run the useful utility usearch -fastq_chars to see the read quality distribution.
Char  ASCII  Q(33)  Q(64)       Tails       Total     Freq   AccFrq
----  -----  -----  -----  ----------  ----------  -------  -------
 '!'     33      0    -31           0        2906    0.01%    0.01%
 '"'     34      1    -30           0           0    0.00%    0.01%
 '#'     35      2    -29           0           0    0.00%    0.01%
 '$'     36      3    -28           0           0    0.00%    0.01%
 '%'     37      4    -27           0           0    0.00%    0.01%
 '&'     38      5    -26           0           0    0.00%    0.01%
 '''     39      6    -25           0           0    0.00%    0.01%
 '('     40      7    -24           0           0    0.00%    0.01%
 ')'     41      8    -23           0           0    0.00%    0.01%
 '*'     42      9    -22           0           0    0.00%    0.01%
 '+'     43     10    -21           0           0    0.00%    0.01%
 ','     44     11    -20           0           0    0.00%    0.01%
 '-'     45     12    -19           0           0    0.00%    0.01%
 '.'     46     13    -18           0           0    0.00%    0.01%
 '/'     47     14    -17           0           0    0.00%    0.01%
 '0'     48     15    -16           0           0    0.00%    0.01%
 '1'     49     16    -15           0           0    0.00%    0.01%
 '2'     50     17    -14           0           0    0.00%    0.01%
 '3'     51     18    -13           0           0    0.00%    0.01%
 '4'     52     19    -12           0           0    0.00%    0.01%
 '5'     53     20    -11           0           0    0.00%    0.01%
 '6'     54     21    -10           0           0    0.00%    0.01%
 '7'     55     22     -9           0           0    0.00%    0.01%
 '8'     56     23     -8           0           8    0.00%    0.01%
 '9'     57     24     -7           0           0    0.00%    0.01%
 ':'     58     25     -6           0         745    0.00%    0.01%
 ';'     59     26     -5           0           0    0.00%    0.01%
 '<'     60     27     -4           0           0    0.00%    0.01%
 '='     61     28     -3           0         391    0.00%    0.01%
 '>'     62     29     -2           0           0    0.00%    0.01%
 '?'     63     30     -1           1          15    0.00%    0.01%
 '@'     64     31      0           3        2928    0.01%    0.02%
 'A'     65     32      1           0        2980    0.01%    0.04%
 'B'     66     33      2           0           0    0.00%    0.04%
 'C'     67     34      3         144       37529    0.13%    0.17%
 'D'     68     35      4        3596      351835    1.24%    1.41%
 'E'     69     36      5        1460      274975    0.97%    2.38%
 'F'     70     37      6           6      121914    0.43%    2.82%
 'G'     71     38      7          23      312858    1.11%    3.92%
 'H'     72     39      8          39      244877    0.87%    4.79%
 'I'     73     40      9          30      264438    0.93%    5.72%
 'J'     74     41     10          27      220404    0.78%    6.50%
 'K'     75     42     11          46      306755    1.08%    7.59%
 'L'     76     43     12          34      258150    0.91%    8.50%
 'M'     77     44     13          92      329095    1.16%    9.66%
 'N'     78     45     14          83      326684    1.16%   10.82%
 'O'     79     46     15          91      365324    1.29%   12.11%
 'P'     80     47     16          87      423488    1.50%   13.61%
 'Q'     81     48     17          76      442600    1.56%   15.17%
 'R'     82     49     18         160      403789    1.43%   16.60%
 'S'     83     50     19         220      541710    1.92%   18.51%
 'T'     84     51     20         137      594089    2.10%   20.61%
 'U'     85     52     21          44      615082    2.17%   22.79%
 'V'     86     53     22         208      568834    2.01%   24.80%
 'W'     87     54     23        3535      298227    1.05%   25.85%
 'X'     88     55     24         694      136779    0.48%   26.34%
 'Y'     89     56     25        9816      784561    2.77%   29.11%
 'Z'     90     57     26       66100    16468153   58.22%   87.34%
 '['     91     58     27        1137     3517684   12.44%   99.77%
 '\'     92     59     28           0           0    0.00%   99.77%
 ']'     93     60     29           0           0    0.00%   99.77%
 '^'     94     61     30           0           0    0.00%   99.77%
 '_'     95     62     31           0       64281    0.23%  100.00%
I see that the majority of the ASCII codes come from ASCII values of 89-90, beginning at ASCII values of 61. This seems to correspond generally to Solexa/Early illumina
Description   ASCII Range      ASCII Offset    Quality score
fastq-solexa      59–126           64     −5 to 62
However, there are two differences. The first is the '!' sign which is the lowest score according to phred33. I don't see why it appears in the Solexa format.
The second difference consists a few occurrences of '8' which correspond to a Solexa quality of -8.
A Solexa score can receive negative values. However, the occurence of the values of scores -8, and -31 (the score of '!') makes me wonder - is it a Solexa score, and what it is, if not.
You can find the valid ranges of fastq scores in this WikiPedia article. Solexa encoded scores are between -5 and 40.
The file I looked at has a range which does not suit any of the illumina scores in the article
Can you run
testformat.shfrom BBMap suite on this file and post the result.Edit:
Test format seems to think that this is Illumina encoded data. Phred+64 but it could be Illumina 1.3 or 1.5.