Question

Dropseq, GSE124872 and structure of the reads in the FastQ files

0

Entering edit mode

2.4 years ago

tmms ▴ 10

Hello

I am confused by the structure of the reads in a FastQ File I downloaded via GEO. The data is stored in GEO GSE124872, GSM3557675. According the SRA Run Selector, there is one FastQ file per sample, but the paper* and SRA mention that the data is paired end. So I was expecting two FastQ files (r1 and r2).

When I downloaded and opened that FastQ file, the headers of the reads caught my attention. They headers contain the length of sequences and that length is almost approximately 200 bp ("@SRR8426358.1 1 length=202"). This seems to me that r1 and r2 are concatenated. However, I couldn't find any source to confirm this. Additionally, I don't see a specific stretch of nucleotides between the reads. I would expect a fixed sequenced between r1 en r2. For clarity, I added one read to this post:

@SRR8426358.1 1 length=202
ATCAATGATCGGTCGTGACTTTTTTTTTTTTTTTTTTTTTTTTTAGTGAAATAAATTCTTTNTTTTTGTTAGAAGACTGATTTTTAAATGTCTTTATCATTGCAAGAAAGTGATAACTGCCTTTAACGATGGACTGAATCACTTGGNAAGCNTCAAGGGCACCTTTGCCAGCCTCAGTGAGCTCCACTGTGACAAGCTGCAT
+SRR8426358.1 1 length=202
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJF--A<F--<----7F--A<J#F-AJA-7-7-<-A-----7<JF---<<-<7FFF-<-7-F<-<A-FJJF<-7<FJ7--77-<-AJJ-A77--7FAJJ7-A7-FJ-#-7--#7-7F7JFFFA-<JFJ-F7-AJ---AAAJ<FF7-7-7FAFJF7--7FJF-A

Do you think it is safe to just split every read into two so r1 contains the first 100 bases and r2 the remaining bases? I tried this for a subsample of the FastQ file, I quantified r1 and r2 with salmon. r1 had a very low mapping rate (1%), while r2 had a normal mapping rate (68%) when mapping the mouse genome. This seems to support the idea that r1 contains the barcode and UMI, while r2 contains the actual sequences.

* "Single-cell libraries were sequenced in a 100 bp paired-end run on the Illumina HiSeq4000 using 0.2 nM denatured sample and 5% PhiX spike-in." ("An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics", Angelidis et.al., 2019)

Thanks in advance and hoping someone can offer some insight into this data.

scRNA-seq Dropseq • 1.4k views

ADD COMMENT • link updated 2.4 years ago by ATpoint 81k • written 2.4 years ago by tmms ▴ 10

1

Entering edit mode

Based on that single read alone, I'd be inclined to think that the first 20 bases are the cell barcode and UMI, and the stuff after the T's is the real RNA sequence.

ADD REPLY • link 2.4 years ago by swbarnes2 14k

score 3 · Accepted Answer · 2021-12-04

I think the 202bp read is due to not properly downloading the data. As you can see here https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8426358 the run is indeed paired-end with R1 and R2 separately. My assumption is that you ran fastq-dump without the --split-files flag? Adding this flag will correctly output two files, the R1 and R2 separately. Yes, UMI/CB is probably R1 and cDNA is R2, at least this is how it goes with 10X Chromium data. I recommend Alevin for the quantification of such as https://salmon.readthedocs.io/en/latest/alevin.html as this has a dedicated --dropseq flag which will parse all relevant CB/UMI and cDNA from the reads automatically. It is basically the single-cell module that builds on the Salmon selective alignment procedure, see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5600148/

For downloading data, you can enter the accessions at sra-explorer.info to get direct fastq download links such as:

curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/008/SRR8426358/SRR8426358_1.fastq.gz -o SRR8426358_GSM3557675_old_Dropseq_1_Mus_musculus_RNA-Seq_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/008/SRR8426358/SRR8426358_2.fastq.gz -o SRR8426358_GSM3557675_old_Dropseq_1_Mus_musculus_RNA-Seq_2.fastq.gz