Question: Fastq Format Redundancy
gravatar for Irsan
6.9 years ago by
Irsan7.2k wrote:

Each read in a fastq file makes up 4 consecutive lines (read id, read sequence, qual id and qual string). What do you need the qual id for? Isn't the read id enough for identification? Besides, in most fastq files (if not all?) the qual id is just "+"

fastq • 1.6k views
ADD COMMENTlink modified 6.9 years ago by SES8.4k • written 6.9 years ago by Irsan7.2k

Yup it would be better to have sth like:

Read_ID \t read_sequence \t qual_sequence

ADD REPLYlink written 6.9 years ago by Ashutosh Pandey12k

I personally like that there are 4 lines as I prefer dividing by 4 to dividing by 3. If I grab first 100/1000/10,000 lines I immediately know how many sequences are there. Or maybe I am just that used to 4 lines that I cannot step back and admit that 3 would be better:))

ADD REPLYlink written 6.9 years ago by Biomonika (Noolean)3.1k

There's no reason I know of offhand that fastqs can't be completely supplemented by bam files. Put all your reads in there (marked unmapped), and you've got everything you need. Many aligners can use bam as input these days.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Chris Miller21k

alas there is always another catch

the BAM SEQ column will display sequences as they align on the forward strand, so sequences aligned on the reverse strand would need to be reverse complemented to obtain the actual data. In addition hard clipping is a valid alignment representation, but that also means that the some of the original information is lost. Then if we consider spliced alignments getting back the original data is probably even more convoluted especially since there is also read pairing to keep track of.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by Istvan Albert ♦♦ 85k

Picard SamToFastq will take care of the strandedness problem if you need to recreate the original fastqs. If all you're doing is storing raw reads, the hard clipping info won't be an issue either (some would argue that you shouldn't be doing hard clipping anyway). Yes, I agree that the spliced alignments would be a pain, but not intractable - just need one brave soul to write the tool so the rest of us can use it :)

ADD REPLYlink written 6.9 years ago by Chris Miller21k
gravatar for Istvan Albert
6.9 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

As described in this paper:

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants Nucl. Acids Res. (2010)

The FASTQ format was invented at the turn of the century at the Wellcome Trust Sanger Institute by Jim Mullikin, gradually disseminated, but never formally documented (Antony V. Cox, Sanger Institute, personal communication 2009).

so we can't be all that surprised that the format has some unspecified characteristics.

this prompted me to look up more information on Jim Mullikin, turns out he is a Director at NIH Intramural Sequencing Center

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Istvan Albert ♦♦ 85k
gravatar for SES
6.9 years ago by
Vancouver, BC
SES8.4k wrote:

I think it's there to serve as a delimiter. One of the most common issues we have to deal with is the problem of line endings caused by going from different operating systems. When Fasta files get messed up because of this you can at least tell where the sequence ends because the header starts with the greater-than sign. It would be chaotic if not for that delimiter. Likewise, I think it would be more difficult trying to find where the sequence ended and the quality line started without this '+' delimiter.

ADD COMMENTlink written 6.9 years ago by SES8.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1521 users visited in the last hour