The general format of a fastq file is (sequence block):
@SEQHEADERandmorestuff AGCTGTGTATGTGTGTGSTGTSGTTACGTGTATCGATCGCTGCTA... +SEQHEADERandmorestuff <--optional header but required + 00300QUALITYSCORESINILLUMINAORSANGERFORMAT.....
(next sequence block)
Since the + and the @ can also be used as quality scores they are not great regexp grep patterns ^+ or ^@. As well as that the headers seem to be used freely as well. Most files seem to stick to the 4 line (defined by \n) structure though and can be easily parsed or split. In the past I used to use a perl regexp patterns that turned out to be quite unique for the header but that is not failsafe:
$count++ if /^@[A-Za-z0-9_.:#\-\/]+\n/; # the hash is a valid char!
However the fastq format is quite loose defined also regarding the different used quality scoring systems :(. Even the "official wiki" tells that wrapped/MULTIPLE line fastq formats are valid.
My question: how common are these wrapped/multi-line files and which tools would generally generate such output? Or is it safe to assume for a new tool that it sticks to the 4 line convention?
We have had this discussion earlier: Split Fastq Files Into Chunks Of 1M Reads , but after that I don't dare to give you an answer any more ;)
In short, both viewpoints are correct: given that 'official' definition, multiline fastq is valid. on the other hand, nobody has ever shown me a fastq file like this (of course I can make one). But how authoritative is the 'official' 'standard' anyway (this is not XML), given there are at least 3 standards for encoding the qualities. thus reading fastq is never safe...
And your question is maybe related to parsing fastq, so look here.
Your expression should be okay. If
^@
matches it should never be in a quality line.Have looked over the earlier examples. Missed the one on sam tools though..... The '@' is a quality score so just using ^Q in this example will fail (or not?):
Why should it fail? Quality-Scores start with a '+' so
^@
will never match such lines.I think you are right when the 3rd line is starting with
+
and the 4th with@
you should be able to detect that. But just counting by greplines starting with@
fails. The first position of the qualityscore LINE (not the HEADER) can be@
since its a valid quality score character. Then it fails and have to fallback on either previous lines to see context, within-block line number etc...grep (as well as many oher perl parsers preventing everything getting in memory) will not do multi line...If you're using a perl script, why not have a regexp specifying that the header column must start with an
@
or+
, followed by less (or more) characters than you have in read length. Something like/^@d{1,read_length - 1}/
or/^@d{read_length + 1,}/
. It may be likely that a quality line starts with an@
or+
, but how likely is it to also be the exact same length as the header line?Mitch I agree that is another good addition since readlength can be determined beforehand or in mixed files on basis of counts. Like the almost unique regexp I posted above. Anyway never 100% sure and the quick
grep -c
is still not possible. Furthermore, I can't oversee what happens in a wrapped sequence/quality score section...Anymore tools that have this behaviour?
I marked the samtools as the answer but would like to keep the question open for more suggestions and as a future reference...
Are you asking because you're writing some sort of generic fastq parser? Then my suggestion would be to write a specific parser instead and convert your data into a sane format (BAM?) exactly once. BAM is much more versatile and very grepable (after on-the-fly conversion to SAM).
@Marvin. No and yes. Actually I was forking/paralleling some tools which need large fastqs to be split. And of course multi line fastqs make it troublesome. Anyway, I now make it just a prerequisite that its not multilined. Or investigate the seqtk stuff.
The question/problem has dissolved over time as of Ilumina data casave 1.8+ is in Sanger scoring (no grooming necessary) BUT MORE IMPORTANTLY the headers have changed! They now contain a SPACE which makes them UNIQUE and easy to recognize with regular expressions starting lines ^@ or ^+ and a space in the line.
An additional method would have been the use of quality scores above what is possible. Previously the quality scores could reach up to the ASCII h (104) now in Sanger only till J (74) so another option for parsing. See the pasted table below from the wiki; This option is dangerous though since we cannot be sure a header WILL contain one of these characters... not by default.
Old:[?] [?][?][?][?][?][?][?][?][?]@HWUSI-EAS100R:6:73:941:1973#0/1
NEW:[?] [?][?][?][?][?][?][?][?][?]@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG [?] [?]
Quality scoring ref table: [?] SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126
S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) [?]
[?]
See the updated wiki http://en.wikipedia.org/wiki/FASTQ_format