I am confused by the structure of the reads in a FastQ File I downloaded via GEO. The data is stored in GEO GSE124872, GSM3557675. According the SRA Run Selector, there is one FastQ file per sample, but the paper* and SRA mention that the data is paired end. So I was expecting two FastQ files (r1 and r2).
When I downloaded and opened that FastQ file, the headers of the reads caught my attention. They headers contain the length of sequences and that length is almost approximately 200 bp ("@SRR8426358.1 1 length=202"). This seems to me that r1 and r2 are concatenated. However, I couldn't find any source to confirm this. Additionally, I don't see a specific stretch of nucleotides between the reads. I would expect a fixed sequenced between r1 en r2. For clarity, I added one read to this post:
@SRR8426358.1 1 length=202
+SRR8426358.1 1 length=202
Do you think it is safe to just split every read into two so r1 contains the first 100 bases and r2 the remaining bases? I tried this for a subsample of the FastQ file, I quantified r1 and r2 with salmon. r1 had a very low mapping rate (1%), while r2 had a normal mapping rate (68%) when mapping the mouse genome. This seems to support the idea that r1 contains the barcode and UMI, while r2 contains the actual sequences.
* "Single-cell libraries were sequenced in a 100 bp paired-end run on the Illumina HiSeq4000 using 0.2 nM denatured sample and 5% PhiX spike-in." ("An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics", Angelidis et.al., 2019)
Thanks in advance and hoping someone can offer some insight into this data.