How Common Are Multi-Line Fastq Files?
1
4
Entering edit mode
13.2 years ago
ALchEmiXt ★ 1.9k

The general format of a fastq file is (sequence block):

@SEQHEADERandmorestuff AGCTGTGTATGTGTGTGSTGTSGTTACGTGTATCGATCGCTGCTA...       +SEQHEADERandmorestuff                                                 <--optional header but required + 00300QUALITYSCORESINILLUMINAORSANGERFORMAT.....

(next sequence block)

Since the + and the @ can also be used as quality scores they are not great regexp grep patterns ^+ or ^@. As well as that the headers seem to be used freely as well. Most files seem to stick to the 4 line (defined by \n) structure though and can be easily parsed or split. In the past I used to use a perl regexp patterns that turned out to be quite unique for the header but that is not failsafe:

$count++ if /^@[A-Za-z0-9_.:#\-\/]+\n/; # the hash is a valid char!

However the fastq format is quite loose defined also regarding the different used quality scoring systems :(. Even the "official wiki" tells that wrapped/MULTIPLE line fastq formats are valid.

My question: how common are these wrapped/multi-line files and which tools would generally generate such output? Or is it safe to assume for a new tool that it sticks to the 4 line convention?

fastq parsing • 7.7k views
ADD COMMENT
3
Entering edit mode

We have had this discussion earlier: Split Fastq Files Into Chunks Of 1M Reads , but after that I don't dare to give you an answer any more ;)

In short, both viewpoints are correct: given that 'official' definition, multiline fastq is valid. on the other hand, nobody has ever shown me a fastq file like this (of course I can make one). But how authoritative is the 'official' 'standard' anyway (this is not XML), given there are at least 3 standards for encoding the qualities. thus reading fastq is never safe...

ADD REPLY
1
Entering edit mode

And your question is maybe related to parsing fastq, so look here.

ADD REPLY
0
Entering edit mode

Your expression should be okay. If ^@ matches it should never be in a quality line.

ADD REPLY
0
Entering edit mode

Have looked over the earlier examples. Missed the one on sam tools though..... The '@' is a quality score so just using ^Q in this example will fail (or not?):

@ILLUMINAtesttest1:2:23
actgtgatcgatcgatcgatcgggtag
+ILLUMINAtesttest1:2:23
@!!!!(!!!!!!!????????BBBEEE
ADD REPLY
0
Entering edit mode

Why should it fail? Quality-Scores start with a '+' so ^@ will never match such lines.

ADD REPLY
0
Entering edit mode

I think you are right when the 3rd line is starting with + and the 4th with @ you should be able to detect that. But just counting by greplines starting with @ fails. The first position of the qualityscore LINE (not the HEADER) can be @ since its a valid quality score character. Then it fails and have to fallback on either previous lines to see context, within-block line number etc...grep (as well as many oher perl parsers preventing everything getting in memory) will not do multi line...

ADD REPLY
0
Entering edit mode

If you're using a perl script, why not have a regexp specifying that the header column must start with an @ or +, followed by less (or more) characters than you have in read length. Something like /^@d{1,read_length - 1}/ or /^@d{read_length + 1,}/. It may be likely that a quality line starts with an @ or +, but how likely is it to also be the exact same length as the header line?

ADD REPLY
0
Entering edit mode

Mitch I agree that is another good addition since readlength can be determined beforehand or in mixed files on basis of counts. Like the almost unique regexp I posted above. Anyway never 100% sure and the quick grep -c is still not possible. Furthermore, I can't oversee what happens in a wrapped sequence/quality score section...

ADD REPLY
0
Entering edit mode

Anymore tools that have this behaviour?

I marked the samtools as the answer but would like to keep the question open for more suggestions and as a future reference...

ADD REPLY
0
Entering edit mode

Are you asking because you're writing some sort of generic fastq parser? Then my suggestion would be to write a specific parser instead and convert your data into a sane format (BAM?) exactly once. BAM is much more versatile and very grepable (after on-the-fly conversion to SAM).

ADD REPLY
0
Entering edit mode

@Marvin. No and yes. Actually I was forking/paralleling some tools which need large fastqs to be split. And of course multi line fastqs make it troublesome. Anyway, I now make it just a prerequisite that its not multilined. Or investigate the seqtk stuff.

ADD REPLY
0
Entering edit mode

The question/problem has dissolved over time as of Ilumina data casave 1.8+ is in Sanger scoring (no grooming necessary) BUT MORE IMPORTANTLY the headers have changed! They now contain a SPACE which makes them UNIQUE and easy to recognize with regular expressions starting lines ^@ or ^+ and a space in the line.

An additional method would have been the use of quality scores above what is possible. Previously the quality scores could reach up to the ASCII h (104) now in Sanger only till J (74) so another option for parsing. See the pasted table below from the wiki; This option is dangerous though since we cannot be sure a header WILL contain one of these characters... not by default.

Old:[?] [?][?][?][?][?][?][?][?][?]@HWUSI-EAS100R:6:73:941:1973#0/1

NEW:[?] [?][?][?][?][?][?][?][?][?]@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG [?] [?]

Quality scoring ref table: [?] SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) [?]

[?]

See the updated wiki http://en.wikipedia.org/wiki/FASTQ_format

ADD REPLY
8
Entering edit mode
13.2 years ago

SAMtools generates multi-line fastq when generating a consensus sequence, eg.

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq

This is the basis of a question I asked a while ago.

ADD COMMENT
1
Entering edit mode

The could be an enpoint for some uses, but we use them transiently to get fasta for downstream analysis. you can use lh3's seqtk to process multi-line fastq. See his answer here.

ADD REPLY
0
Entering edit mode

So how does the postprocessing deal with this multi-line samtools output?

ADD REPLY
0
Entering edit mode

Sorry, I don't fully understand what you are asking?

ADD REPLY
0
Entering edit mode

@Casey. Sorry for not being clear; I mean the samtools pileup generates these multi-line fastq files. These fastq files aren't the endpoint right? So which tools are used subsequently since they seem to be able to deal with these wrapped fastq correctly!?

ADD REPLY
0
Entering edit mode

@Casey. Thats useful. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6