The general format of a fastq file is (sequence block):
@SEQHEADERandmorestuff AGCTGTGTATGTGTGTGSTGTSGTTACGTGTATCGATCGCTGCTA... +SEQHEADERandmorestuff <--optional header but required + 00300QUALITYSCORESINILLUMINAORSANGERFORMAT.....
(next sequence block)
Since the + and the @ can also be used as quality scores they are not great regexp grep patterns ^+ or ^@. As well as that the headers seem to be used freely as well. Most files seem to stick to the 4 line (defined by \n) structure though and can be easily parsed or split. In the past I used to use a perl regexp patterns that turned out to be quite unique for the header but that is not failsafe:
$count++ if /^@[A-Za-z0-9_.:#\-\/]+\n/; # the hash is a valid char!
However the fastq format is quite loose defined also regarding the different used quality scoring systems :(. Even the "official wiki" tells that wrapped/MULTIPLE line fastq formats are valid.
My question: how common are these wrapped/multi-line files and which tools would generally generate such output? Or is it safe to assume for a new tool that it sticks to the 4 line convention?