How Common Are Multi-Line Fastq Files?
1
4
Entering edit mode
10.2 years ago
ALchEmiXt ★ 1.9k

The general format of a fastq file is (sequence block):

(next sequence block)

Since the + and the @ can also be used as quality scores they are not great regexp grep patterns ^+ or ^@. As well as that the headers seem to be used freely as well. Most files seem to stick to the 4 line (defined by \n) structure though and can be easily parsed or split. In the past I used to use a perl regexp patterns that turned out to be quite unique for the header but that is not failsafe:

$count++ if /^@[A-Za-z0-9_.:#\-\/]+\n/; # the hash is a valid char!  However the fastq format is quite loose defined also regarding the different used quality scoring systems :(. Even the "official wiki" tells that wrapped/MULTIPLE line fastq formats are valid. My question: how common are these wrapped/multi-line files and which tools would generally generate such output? Or is it safe to assume for a new tool that it sticks to the 4 line convention? fastq parsing • 5.5k views ADD COMMENT 2 Entering edit mode We have had this discussion earlier: Split Fastq Files Into Chunks Of 1M Reads , but after that I don't dare to give you an answer any more ;) In short, both viewpoints are correct: given that 'official' definition, multiline fastq is valid. on the other hand, nobody has ever shown me a fastq file like this (of course I can make one). But how authoritative is the 'official' 'standard' anyway (this is not XML), given there are at least 3 standards for encoding the qualities. thus reading fastq is never safe... ADD REPLY 1 Entering edit mode And your question is maybe related to parsing fastq, so look here. ADD REPLY 0 Entering edit mode Your expression should be okay. If ^@ matches it should never be in a quality line. ADD REPLY 0 Entering edit mode Have looked over the earlier examples. Missed the one on sam tools though..... The '@' is a quality score so just using ^Q in this example will fail (or not?): @ILLUMINAtesttest1:2:23 actgtgatcgatcgatcgatcgggtag +ILLUMINAtesttest1:2:23 @!!!!(!!!!!!!????????BBBEEE  ADD REPLY 0 Entering edit mode Why should it fail? Quality-Scores start with a '+' so ^@ will never match such lines. ADD REPLY 0 Entering edit mode I think you are right when the 3rd line is starting with + and the 4th with @ you should be able to detect that. But just counting by greplines starting with @ fails. The first position of the qualityscore LINE (not the HEADER) can be @ since its a valid quality score character. Then it fails and have to fallback on either previous lines to see context, within-block line number etc...grep (as well as many oher perl parsers preventing everything getting in memory) will not do multi line... ADD REPLY 0 Entering edit mode If you're using a perl script, why not have a regexp specifying that the header column must start with an @ or +, followed by less (or more) characters than you have in read length. Something like /^@d{1,read_length - 1}/ or /^@d{read_length + 1,}/. It may be likely that a quality line starts with an @ or +, but how likely is it to also be the exact same length as the header line? ADD REPLY 0 Entering edit mode Mitch I agree that is another good addition since readlength can be determined beforehand or in mixed files on basis of counts. Like the almost unique regexp I posted above. Anyway never 100% sure and the quick grep -c is still not possible. Furthermore, I can't oversee what happens in a wrapped sequence/quality score section... ADD REPLY 0 Entering edit mode Anymore tools that have this behaviour? I marked the samtools as the answer but would like to keep the question open for more suggestions and as a future reference... ADD REPLY 0 Entering edit mode Are you asking because you're writing some sort of generic fastq parser? Then my suggestion would be to write a specific parser instead and convert your data into a sane format (BAM?) exactly once. BAM is much more versatile and very grepable (after on-the-fly conversion to SAM). ADD REPLY 0 Entering edit mode @Marvin. No and yes. Actually I was forking/paralleling some tools which need large fastqs to be split. And of course multi line fastqs make it troublesome. Anyway, I now make it just a prerequisite that its not multilined. Or investigate the seqtk stuff. ADD REPLY 0 Entering edit mode The question/problem has dissolved over time as of Ilumina data casave 1.8+ is in Sanger scoring (no grooming necessary) BUT MORE IMPORTANTLY the headers have changed! They now contain a SPACE which makes them UNIQUE and easy to recognize with regular expressions starting lines ^@ or ^+ and a space in the line. An additional method would have been the use of quality scores above what is possible. Previously the quality scores could reach up to the ASCII h (104) now in Sanger only till J (74) so another option for parsing. See the pasted table below from the wiki; This option is dangerous though since we cannot be sure a header WILL contain one of these characters... not by default. Old:[?] [?][?][?][?][?][?][?][?][?]@HWUSI-EAS100R:6:73:941:1973#0/1 NEW:[?] [?][?][?][?][?][?][?][?][?]@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG [?] [?] Quality scoring ref table: [?] SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ...................... LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL.................................................... !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, raw reads typically (0, 40) X - Solexa Solexa+64, raw reads typically (-5, 40) I - Illumina 1.3+ Phred+64, raw reads typically (0, 40) J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above). L - Illumina 1.8+ Phred+33, raw reads typically (0, 41) [?]

[?]

See the updated wiki http://en.wikipedia.org/wiki/FASTQ_format

8
Entering edit mode
10.2 years ago

SAMtools generates multi-line fastq when generating a consensus sequence, eg.

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq
`

This is the basis of a question I asked a while ago.

1
Entering edit mode

The could be an enpoint for some uses, but we use them transiently to get fasta for downstream analysis. you can use lh3's seqtk to process multi-line fastq. See his answer here.

0
Entering edit mode

So how does the postprocessing deal with this multi-line samtools output?

0
Entering edit mode

Sorry, I don't fully understand what you are asking?

0
Entering edit mode

@Casey. Sorry for not being clear; I mean the samtools pileup generates these multi-line fastq files. These fastq files aren't the endpoint right? So which tools are used subsequently since they seem to be able to deal with these wrapped fastq correctly!?

0
Entering edit mode

@Casey. Thats useful. Thanks!