Fastq Format Confusion
4
6
Entering edit mode
8.9 years ago

Hi!

In the following article The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. I found this table:

                       Range Offset    Type    Range

Sanger standard          33–126    33    PHRED    0 to 93**

Solexa/early Illumina    59–126    64    Solexa    −5 to 62**

Illumina 1.3+            64–126    64    PHRED    0 to 62**


From the article (2010) it seemed that the consensus might be to use Illumina 1.3+ coding in the future. Then I came across this:

"Note that the latest Illumina CASAVA 1.8 pipeline (released June 2011), outputs in fastq-sanger rather than Illumina 1.3+. Thus Illumina 1.3+ and other Illumina scoring metrics are unlikely to be encountered if you are using Illumina sequencing data generated after July 2011" from this source

So, how is it? Can I rely on FASTQC to correctly display bases quality? Based on what I read, I would say no, fastq format can use characters that do not allow to recognize between different quality formats. But there is no possibility to switch between different formats in this program, is it?

And second, I have old 454 data, how can I determine their encoding? I have tried to google what is common standard for 454 with not much success.

Hope that you guys here are much more experienced. How do you deal with different fastq formats? Please, share your experience. Thanks a lot!

fastq format conversion quality scoring • 8.0k views
1
Entering edit mode

Just to add to the fastq bashing (not really helpful, I know): fastq is not a format in the strict sense, because it is lacking a proper definition allowing for deterministic parsing of the 'format'. I agree with Ido about the punishment.

0
Entering edit mode

7
Entering edit mode
8.9 years ago
Ido Tamir 5.2k

Yes, fastqc can reliably detect which fastq format it is, unless the quality string is only characters BCDEFGHI. But this is highly unlikely. This would mean all data is crap (illumina <= 1.5) or only good (illumina > 1.5, Q >= 33).

http://en.wikipedia.org/wiki/FASTQ_format#Encoding provides a better view on the problem than your table.

Like others, I think unaligned BAM ist a step into the right direction but nevertheless I know which pipeline my old fastq files come from. The fastq variants are all illuminas fault. They should get kicked somewhere for this. hard. 454 uses Phred+33, I guess and binging it told me.

3
Entering edit mode
8.9 years ago
Chris Fields ★ 2.2k

@Noolean, the 2010 paper indicated that everyone should standardized on Sanger-based scoring, +33. This is the standard adopted by SRA and ENA (I believe), and is the quality string encoding for SAM/BAM. Also, Illumina pipelines should be using Sanger-based encoding these days with FASTQ, which is likely what that July 2011 means (e.g. CASAVA updates now default to Sanger).

I likewise support a move to something analogous to BAM for sequencing data. However, I'm not sure how it makes sense to use BAM itself as a storage mechanism, primarily b/c the entire format and indexing algorithm are based on efficient storage and retrieval of data using an R-tree/binning-based index of reads mapped against reference sequences (the latter included in the BAM header). Has there been an in-depth discussion of this?

But maybe that's a point best left for another question...

1
Entering edit mode
8.9 years ago
Gabriel R. ★ 2.8k

How to deal with fastq ? Simple, don't use it, use unaligned BAM. Don't use it whenever possible, some people still release tool that only read fastq.

1
Entering edit mode

Isn't BAM outdated and replaced by CRAM http://www.ebi.ac.uk/ena/about/cram_toolkit ? More seriously, I'm not sure that a tool (a CASAVA alternative) which understands illumina raw outputs and makes a seq/quality output in a BAM format does exist. Then, in the workflow, one has to deal with fastq files anyway at some point.

1
Entering edit mode

just because somebody published something doesnt make something else outdatet. I had enough problems making my users get used to bam. The main reason is the metadata in the file. And yes, the tools exist and I only see fastq files because bowtie and other tools use them for input. https://github.com/wtsi-npg/illumina2bam. But for an intermediary fastq is o.k. (the encoding is then always +33)

0
Entering edit mode

Sorry to not have been clear enough, "more seriously" meant it was a joke. Of course BAM IS NOT outdated! Thanks for sharing the illumina2bam tool.