Reasonable Assumptions About Fastq File Integrity
1
0
Entering edit mode
10.8 years ago

Can I assume that the genomic sequences and quality sequences in a FASTQ file will be of the same length — not only within a read, but through the entire file, for all reads?

For example, here are a few reads from a sample file:

@IRIS:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa]baaaaa_aab]D^^baYDW]abaa^
@IRIS:7:1:17:800#0/1
GGAAACACTACTTAGGCTTATAAGATCNGGTTGCGG
+IRIS:7:1:17:800#0/1
ababbaaabaaaaa]ba]aaaaYD\\_aXT
@IRIS:7:1:17:1757#0/1
TTTTCTCGACGATTTCCACTCCTGGTCNACGAATCC
+IRIS:7:1:17:1757#0/1
aaaaaaaaaaaaa_^a]][Z[DY^XYV^_Y
...


Can I assume the file (or read) is bad, if the read has a shorter genomic and/or quality sequence, e.g. the second read in this example:

@IRIS:7:1:17:394#0/1
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG
+IRIS:7:1:17:394#0/1
aaabaa]baaaaa_aab]D^^baYDW]abaa^
@IRIS:7:1:17:800#0/1
GGAAACACTACTTAGGCTTATA
+IRIS:7:1:17:800#0/1
ababbaaabaaaaa]ba]
@IRIS:7:1:17:1757#0/1
TTTTCTCGACGATTTCCACTCCTGGTCNACGAATCC
+IRIS:7:1:17:1757#0/1
aaaaaaaaaaaaa_^a]][Z[DY^XYV^_Y
...


Or can a FASTQ file deliberately contain reads (and quality strings) of variable lengths?

fastq filter fastq data • 2.0k views
6
Entering edit mode
10.8 years ago

The FASTQ standard requires that for any record the length of the sequence line (line 2) must match the length of the quality line (4).

While instruments usually produce identical sequence lengths for all records this cannot be assumed to be so for all fastq files. For example quality trimming may be applied that could chop off bases from the beginning or end of sequences.

1
Entering edit mode

For example, Ion Torrent produces FastQ files with reads of variable length

0
Entering edit mode

Darn. I knew that the sequence and quality strings need to be of identical length, but I was hoping I could get away with reads of same length across the entire file. Thanks to you both for your answers.