Question: Is B-I A Valid Fastq Score Range?
0
gravatar for Martin A Hansen
6.2 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

I received some data from a third party provider where the FASTQ files have scores encoded in the range from B (ascii 66) to i (ascii 105). This range is not described in the Wikipedia entry on the FASTQ format, so is this range valid?

fastq quality • 2.0k views
ADD COMMENTlink modified 6.2 years ago by Istvan Albert ♦♦ 80k • written 6.2 years ago by Martin A Hansen3.0k

This is FASTQ data from what I believe is Illumina sequencing and processing with the Illumina 1.5+ pipeline (that remains to be confirmed).

ADD REPLYlink written 6.2 years ago by Martin A Hansen3.0k
2
gravatar for Andreas
6.2 years ago by
Andreas2.4k
Singapore
Andreas2.4k wrote:

EDIT 2: This is not the correct answer (see EDIT below) and it should therefore not have been upvoted. Please see Istvan's answer below.

Actually, this range is valid and is mentioned in the Wikipedia article you cite. This looks like Illumina 1.3-1.7 with an ASCII offset of 64. So B translates to 2 (a special value marking nucleotides that should be ignored) and i to 41 (EDIT 1: sorry, said initially 39. And 41 is actually not expected). Here's the relevant part from the section "Encoding":

Starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).

Andreas

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Andreas2.4k
1

The "i" is ASCII 105. That would be 41 which is going by the paragraph you cite not expected

ADD REPLYlink written 6.2 years ago by lelle790
2
gravatar for Istvan Albert
6.2 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

Capping at the maximal quality value of 40 is a convention that by now most instruments adopted. Technically the Phred quality scores go from 0 to 93. So the use of the quality 'i' does not necessarily indicate a problem.

That being said it is a bit suspicious when you see quality scores that are just out of the usual range. Plus this looks like one of the older quality encodings. But if so you may have other problems, the probability formula was defined slightly differently for some of these encodings thus the values are not directly comparable anyhow. (This is how I recall it)

Peter s paper has more details on the nitty gritty The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants, Nucl. Acids Res. (2010)

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Istvan Albert ♦♦ 80k

I read the paper and am utterly depressed. FASTQ is a stupid format - wrong choice of delimiters. A tab separated table with one line per entry would be better. With respect to the mess of encoding I blame Solexa.

ADD REPLYlink written 6.2 years ago by Martin A Hansen3.0k
0
gravatar for lelle
6.2 years ago by
lelle790
Berlin
lelle790 wrote:

As there is no actual standard for FASTQ there is no possibility to say what is a "valid" FASTQ file. It all depends on what the tools you want to use will expect and accept.

ADD COMMENTlink written 6.2 years ago by lelle790

I am writing the tools here. So what should they expect and accept?

ADD REPLYlink written 6.2 years ago by Martin A Hansen3.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1705 users visited in the last hour