Question: Fastq Format Confusion
6
gravatar for Biomonika (Noolean)
6.9 years ago by
State College, PA, USA
Biomonika (Noolean)3.1k wrote:

Hi!

In the following article The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. I found this table:

                       Range Offset    Type    Range

Sanger standard          33–126    33    PHRED    0 to 93**

Solexa/early Illumina    59–126    64    Solexa    −5 to 62**

Illumina 1.3+            64–126    64    PHRED    0 to 62**

From the article (2010) it seemed that the consensus might be to use Illumina 1.3+ coding in the future. Then I came across this:

"Note that the latest Illumina CASAVA 1.8 pipeline (released June 2011), outputs in fastq-sanger rather than Illumina 1.3+. Thus Illumina 1.3+ and other Illumina scoring metrics are unlikely to be encountered if you are using Illumina sequencing data generated after July 2011" from this source

So, how is it? Can I rely on FASTQC to correctly display bases quality? Based on what I read, I would say no, fastq format can use characters that do not allow to recognize between different quality formats. But there is no possibility to switch between different formats in this program, is it?

And second, I have old 454 data, how can I determine their encoding? I have tried to google what is common standard for 454 with not much success.

Hope that you guys here are much more experienced. How do you deal with different fastq formats? Please, share your experience. Thanks a lot!

ADD COMMENTlink modified 6.2 years ago by arkarachai.af10 • written 6.9 years ago by Biomonika (Noolean)3.1k
1

Just to add to the fastq bashing (not really helpful, I know): fastq is not a format in the strict sense, because it is lacking a proper definition allowing for deterministic parsing of the 'format'. I agree with Ido about the punishment.

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Michael Dondrup46k

Agreed. fastq is horribad.

ADD REPLYlink written 6.4 years ago by Damian Kao15k
7
gravatar for Ido Tamir
6.9 years ago by
Ido Tamir5.0k
Austria
Ido Tamir5.0k wrote:

Yes, fastqc can reliably detect which fastq format it is, unless the quality string is only characters BCDEFGHI. But this is highly unlikely. This would mean all data is crap (illumina <= 1.5) or only good (illumina > 1.5, Q >= 33).

http://en.wikipedia.org/wiki/FASTQ_format#Encoding provides a better view on the problem than your table.

Like others, I think unaligned BAM ist a step into the right direction but nevertheless I know which pipeline my old fastq files come from. The fastq variants are all illuminas fault. They should get kicked somewhere for this. hard. 454 uses Phred+33, I guess and binging it told me.

ADD COMMENTlink written 6.9 years ago by Ido Tamir5.0k
3
gravatar for Chris Fields
6.9 years ago by
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

@Noolean, the 2010 paper indicated that everyone should standardized on Sanger-based scoring, +33. This is the standard adopted by SRA and ENA (I believe), and is the quality string encoding for SAM/BAM. Also, Illumina pipelines should be using Sanger-based encoding these days with FASTQ, which is likely what that July 2011 means (e.g. CASAVA updates now default to Sanger).

I likewise support a move to something analogous to BAM for sequencing data. However, I'm not sure how it makes sense to use BAM itself as a storage mechanism, primarily b/c the entire format and indexing algorithm are based on efficient storage and retrieval of data using an R-tree/binning-based index of reads mapped against reference sequences (the latter included in the BAM header). Has there been an in-depth discussion of this?

But maybe that's a point best left for another question...

ADD COMMENTlink written 6.9 years ago by Chris Fields2.1k
1
gravatar for Gabriel R.
6.9 years ago by
Gabriel R.2.6k
Center for Geogenetik Københavns Universitet
Gabriel R.2.6k wrote:

How to deal with fastq ? Simple, don't use it, use unaligned BAM. Don't use it whenever possible, some people still release tool that only read fastq.

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Gabriel R.2.6k
1

Isn't BAM outdated and replaced by CRAM http://www.ebi.ac.uk/ena/about/cram_toolkit ? More seriously, I'm not sure that a tool (a CASAVA alternative) which understands illumina raw outputs and makes a seq/quality output in a BAM format does exist. Then, in the workflow, one has to deal with fastq files anyway at some point.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Manu Prestat3.9k
1

just because somebody published something doesnt make something else outdatet. I had enough problems making my users get used to bam. The main reason is the metadata in the file. And yes, the tools exist and I only see fastq files because bowtie and other tools use them for input. https://github.com/wtsi-npg/illumina2bam. But for an intermediary fastq is o.k. (the encoding is then always +33)

ADD REPLYlink written 6.9 years ago by Ido Tamir5.0k

Sorry to not have been clear enough, "more seriously" meant it was a joke. Of course BAM IS NOT outdated! Thanks for sharing the illumina2bam tool.

ADD REPLYlink written 6.9 years ago by Manu Prestat3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2233 users visited in the last hour