Question

Illumina X ten samples have phred scores out of range [0,41]

4

Entering edit mode

8.2 years ago

Wai Yi Leung ▴ 60

In our lab we are working with Illumina X ten samples for quitte some time.

Recently we were having a more indepth inspection on the sequences delivered from the Illumina X ten runs.

We were looking into the phred scores from several samples and found phred scores which exceed the set bounderies in the Illumina 1.8+ spec. (https://en.wikipedia.org/wiki/FASTQ_format)

The specs specifies the range from 0 .. 41

While in our samples we find something like the following:

@ST-E00294:24:H5375CCXX:5:1101:7384:1836 1:N:0
TCTATACCTATCAATTGTCCCGTANNNAGANCNTTCTCGNCTNCNNNTCTTCNNANNNNCCCNNTGTTATTCNCATCGACTTCCCCNNTTNTTNNNANNTGTAACCTNNTCNANNCCACCNNTGATTCCTTTTATTGGTCATCTTTAGTC
+
AAAF,KKAFKKFFFAKKA7F7F,,###AF,#F#7FF77,#AF#,###F<FKK##F####,7,##,A,,,,,K#KF,,,,,,,,<<,##,,#,A###<##KKA,,,77##,,#A##,7,,,##7FF7<,7<FKKFKKKK,,,<,,F<,,7,

You can see that phred-score K is in the quality string, which encodes for phred(q)=42

Anyone knows which spec the Illumina X ten is following or am I seeing a bug in the BaseSpace software for these machines?

sequencing next-gen quality phred-encoding • 5.4k views

ADD COMMENT • link updated 8.2 years ago by Dan D 7.4k • written 8.2 years ago by Wai Yi Leung ▴ 60

0

Entering edit mode

As @Dan points out below scores >40 are legal: http://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_understanding_quality_scores.pdf

ADD REPLY • link 8.2 years ago by GenoMax 141k

0

Entering edit mode

This document unfortunately says nothing about the range of acceptable values.

ADD REPLY • link 8.2 years ago by sndrtj ▴ 180

4

Entering edit mode

I don't think Illumina has any sort of company-wide standard on quality scores. They have as many sets of quality score meanings as they do versions of base-calling software. I've seen many Illumina files in which bases with a quality score of 0 (but still called with ACGT rather than N) were correct 100% of the time - higher than any other quality score. Sometimes 2 is "special", sometimes it isn't. Sometimes the values are binned, and the bins will change between software versions. The only constant is that none of them are ever calibrated, so the only reliable way to determine their meaning is through observation and measurement.

It's useful to be able to deal with values outside of what I consider the normal FASTQ range of 0-41 because there are some programs that violate that range. Read-merging and other error-correction tools are the worst offenders, which may give quality scores up to ASCII 99, or 122 (z), or 126, or whatever the programmer thought was best.

There is usually no reason to cap quality scores at any particular value (up to 126, which ends the printable range) except to solve a problem that Illumina singlehandedly created - their own inability to standardize on a quality scheme. They are the only organization to use ASCII-64 or ASCII-66 encodings (sometimes containing negative numbers, and thus dropping below 64 or 66). As a result, it will be forever difficult to auto-detect the quality-encoding format of Illumina data. The main reason for programs to act strangely upon reading quality scores over 41 is to prevent old Illumina ASCII-64/66 files from being processed as ASCII-33.

The lack of quality resolution incurred by capping things at Q41 is not overly important at present because no platform is capable of consistently delivering raw reads at >Q41. Aside from Illumina's Q0 non-N bases, which are frankly astonishing - they should aim for more of those.

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

1

Entering edit mode

Thank you for your elaborate answer. For our specific use-case, we are implementing a FastQ validator in our analysis pipeline to check the validity (and propagation to either continue / halt pipeline ).

As we take the assumption the range is 0..41, our setup with the validation is not working for the fastq files containing Q42 phred-scores. The challenge now is to write rules to properly identify the quality ranges (solexa, sanger, illumina 1.3/1.5/1.8+ + Q42-"spec")

I have the same concerns as you; sequencing companies that are not able to conform to their own specifications. Which makes the jobs of software developers / researchers challenging. How can we tell that we are comparing the same information as in this case the quality scale is moving from 0..41 to 0..42 (as Q42 represents a relative value of 1.025, setting a new ceiling?)

ADD REPLY • link 8.2 years ago by Wai Yi Leung ▴ 60

1

Entering edit mode

It is impossible to write a program that will always be able to correctly determine the quality-score encoding of fastq files. BBMap comes with a tool called testformat.sh that uses various heuristics to guess the quality encoding (and other things, like whether the reads are interleaved, whether they are fasta, fastq, or sam, etc), but it cannot be guaranteed to be correct, as the quality score ranges of different encodings overlap.

Sometimes you can be certain about the offset - if you encounter an N with a quality score of "!", it's ASCII-33. You still can't tell the specific software version, of course. Probably, if you scan far enough into a file, you will eventually encounter an N encoded in a way that makes the encoding certain. But, I've seen Illumina files with N's getting positive quality scores, so that's not certain either! They are rare, though. Illumina usually gives Ns a quality score of 0.

BBMap's TestFormat tool only looks at the first two reads, so it's very fast. But if you want to increase confidence, you could read the whole file and calculate the frequencies of quality assignments, and hopefully encounter Ns which uniquely identify the file's quality encoding. Actually, I should add that capability as an option...

Or, if you have financial clout, you could call Illumina and tell them to start using standards.

ADD REPLY • link 8.1 years ago by Brian Bushnell 20k

score 4 · Answer 1 · 2016-02-29

4

Entering edit mode

8.2 years ago

Dan D 7.4k

It's not a bug. A different version of RTA is used on the HISeqX line than on the HiSeq 2XXX series. These quality scores are simply encoded in a different Phred scale. Furthermore, the quality scores for the HiSeqX are "binned" so that you won't see the full range of the quality scale represented. I think this binning also occurs in the 3000/4000 line of sequencers.

ADD COMMENT • link 8.2 years ago by Dan D 7.4k

1

Entering edit mode

Thank you for your reply. Technically, phred scored can consume the whole ASCII spectrum (to q=126 (127 is not printable?))

I would like to know to which spec the new scale is conforming. I cannot really read from the documentation that they defined a spec like Illumina 2.0 or so for this case I found?

ADD REPLY • link 8.2 years ago by Wai Yi Leung ▴ 60

1

Entering edit mode

AFAIK the scale has not changed. Extension of values beyond 40 has been allowed for some time (with the release of V3 chemistry for HiSeq in mid-2011, if I remember it right): http://www.illumina.com/science/education/sequencing-quality-scores.html

ADD REPLY • link 8.2 years ago by GenoMax 141k

0

Entering edit mode

The best reference I know of specific to HiSeqX is the HiSeqX User Guide, specifically Appendix B which starts on page 53. It's not really that detailed, unfortunately. I can tell you that right now the highest quality score bin is K.

ADD REPLY • link 8.2 years ago by Dan D 7.4k

0

Entering edit mode

I suspect Illumina writes its specifications based on observation of outputs, by a completely different team, with minimal communication. Considering how irrational this would be as a conscious design choice - it has huge disadvantages, and zero advantages, especially considering that new Illumina sequencers can't produce actual Q42 bases - I am willing to give them the benefit of the doubt and call it a bug.

ADD REPLY • link 8.1 years ago by Brian Bushnell 20k