Fastq Format Redundancy
2
0
Entering edit mode
10.4 years ago
Irsan ★ 7.8k

Each read in a fastq file makes up 4 consecutive lines (read id, read sequence, qual id and qual string). What do you need the qual id for? Isn't the read id enough for identification? Besides, in most fastq files (if not all?) the qual id is just "+"

fastq • 2.4k views
ADD COMMENT
0
Entering edit mode

Yup it would be better to have sth like:

Read_ID \t read_sequence \t qual_sequence

ADD REPLY
0
Entering edit mode

I personally like that there are 4 lines as I prefer dividing by 4 to dividing by 3. If I grab first 100/1000/10,000 lines I immediately know how many sequences are there. Or maybe I am just that used to 4 lines that I cannot step back and admit that 3 would be better:))

ADD REPLY
0
Entering edit mode

There's no reason I know of offhand that fastqs can't be completely supplemented by bam files. Put all your reads in there (marked unmapped), and you've got everything you need. Many aligners can use bam as input these days.

ADD REPLY
1
Entering edit mode

alas there is always another catch

the BAM SEQ column will display sequences as they align on the forward strand, so sequences aligned on the reverse strand would need to be reverse complemented to obtain the actual data. In addition hard clipping is a valid alignment representation, but that also means that the some of the original information is lost. Then if we consider spliced alignments getting back the original data is probably even more convoluted especially since there is also read pairing to keep track of.

ADD REPLY
0
Entering edit mode

Picard SamToFastq will take care of the strandedness problem if you need to recreate the original fastqs. If all you're doing is storing raw reads, the hard clipping info won't be an issue either (some would argue that you shouldn't be doing hard clipping anyway). Yes, I agree that the spliced alignments would be a pain, but not intractable - just need one brave soul to write the tool so the rest of us can use it :)

ADD REPLY
1
Entering edit mode
10.4 years ago

As described in this paper:

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants Nucl. Acids Res. (2010)

The FASTQ format was invented at the turn of the century at the Wellcome Trust Sanger Institute by Jim Mullikin, gradually disseminated, but never formally documented (Antony V. Cox, Sanger Institute, personal communication 2009).

so we can't be all that surprised that the format has some unspecified characteristics.

this prompted me to look up more information on Jim Mullikin, turns out he is a Director at NIH Intramural Sequencing Center

ADD COMMENT
0
Entering edit mode
10.4 years ago
SES 8.6k

I think it's there to serve as a delimiter. One of the most common issues we have to deal with is the problem of line endings caused by going from different operating systems. When Fasta files get messed up because of this you can at least tell where the sequence ends because the header starts with the greater-than sign. It would be chaotic if not for that delimiter. Likewise, I think it would be more difficult trying to find where the sequence ended and the quality line started without this '+' delimiter.

ADD COMMENT

Login before adding your answer.

Traffic: 2701 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6