Is B-I A Valid Fastq Score Range?
3
0
Entering edit mode
11.9 years ago

I received some data from a third party provider where the FASTQ files have scores encoded in the range from B (ascii 66) to i (ascii 105). This range is not described in the Wikipedia entry on the FASTQ format, so is this range valid?

fastq quality • 3.5k views
ADD COMMENT
0
Entering edit mode

This is FASTQ data from what I believe is Illumina sequencing and processing with the Illumina 1.5+ pipeline (that remains to be confirmed).

ADD REPLY
2
Entering edit mode
11.9 years ago
Andreas ★ 2.5k

EDIT 2: This is not the correct answer (see EDIT below) and it should therefore not have been upvoted. Please see Istvan's answer below.

Actually, this range is valid and is mentioned in the Wikipedia article you cite. This looks like Illumina 1.3-1.7 with an ASCII offset of 64. So B translates to 2 (a special value marking nucleotides that should be ignored) and i to 41 (EDIT 1: sorry, said initially 39. And 41 is actually not expected). Here's the relevant part from the section "Encoding":

Starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).

Andreas

ADD COMMENT
1
Entering edit mode

The "i" is ASCII 105. That would be 41 which is going by the paragraph you cite not expected

ADD REPLY
2
Entering edit mode
11.9 years ago

Capping at the maximal quality value of 40 is a convention that by now most instruments adopted. Technically the Phred quality scores go from 0 to 93. So the use of the quality 'i' does not necessarily indicate a problem.

That being said it is a bit suspicious when you see quality scores that are just out of the usual range. Plus this looks like one of the older quality encodings. But if so you may have other problems, the probability formula was defined slightly differently for some of these encodings thus the values are not directly comparable anyhow. (This is how I recall it)

Peter s paper has more details on the nitty gritty The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucl. Acids Res. (2010)

ADD COMMENT
0
Entering edit mode

I read the paper and am utterly depressed. FASTQ is a stupid format - wrong choice of delimiters. A tab separated table with one line per entry would be better. With respect to the mess of encoding I blame Solexa.

ADD REPLY
0
Entering edit mode
11.9 years ago
lelle ▴ 830

As there is no actual standard for FASTQ there is no possibility to say what is a "valid" FASTQ file. It all depends on what the tools you want to use will expect and accept.

ADD COMMENT
0
Entering edit mode

I am writing the tools here. So what should they expect and accept?

ADD REPLY

Login before adding your answer.

Traffic: 1430 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6