Determine where an interleaved FASTQ record starts
0
0
Entering edit mode
11 weeks ago
ole.tange ★ 4.1k

FASTQ-files have a record length of 4 lines. But you can also determine where a record starts even in the middle of a file by looking at '@' and lines around that (see https://stackoverflow.com/a/41707920/363028).

Can we do something similar with interleaved FASTQ-files?

Based on https://stackoverflow.com/a/68707816/363028: is there something that tells us where an interleaved FASTQ-record starts?

@M10991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28
NGCTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAA
+
#8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF
@M10991:61:000000000-A7EML:1:1101:14011:1001 2:N:0:28
NGCTCCTAGGTCGGCATGACGCTAGCTACGATCGACTACGCTAGCATCGAGAGTAGCAA
+
#8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF
@M10991:61:000000000-A7EML:1:1201:15411:3101 1:N:0:28
NGCTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAA
+
#8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF
@M10991:61:000000000-A7EML:1:1201:15411:3101 2:N:0:28
CGCTAGCTACGACTCGACGACAGCGAACACGCGATCGATCGGAAATGAGAGAGTAGCAA
+
#8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF


In the above example you can use the '@' trick combined with '.* 1:N' to determine this seqname is of a R1. But does this always work? And if not: Is there something else that can tell us, whether a FASTQ-record is for R1 or R2?

fastq • 143 views
0
Entering edit mode

That is correct, for Illumina latest ones where R1 is denoted by 1 and R2 is denoted by 2. But you can also find illumina sequences for R1 and R2 like: @K00193:38:H3MYFBBXX:4:1101:10003:44458/1 (for R1) and @K00193:38:H3MYFBBXX:4:1101:10003:44458/2 (for R2). Some developers do not even care 1 or 2. Logic is there would be a block of 8 lines, within 8 line block, first 4 line block belongs to R1 and second 4 line block belongs to R2.

Please refer to wiki for fastq format definition for general understanding.