Question: only one biological read present in fastq NCBI database for paired-end sequencing
0
gravatar for Matt
5 weeks ago by
Matt10
Matt10 wrote:

Hi, I would like to extract the 2 biological reads of a RNAseq single cell of a paired-end sequencing. With this run SRR11772847 I tried the command line of the sra-toolkit ./fastq-dump --skip-technical --split-3 SRR11772847 I should have 2 .fastq but I only get one with reads of size 98 bp (there is an extract below), there are 496352056 lines

@SRR11772847.1.3 NB502129:188:HY73HBGX9:1:11101:13593:1050 length=98
NCGATCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR11772847.1.3 NB502129:188:HY73HBGX9:1:11101:13593:1050 length=98
#AAAAEA###########################################################################################
@SRR11772847.2.3 NB502129:188:HY73HBGX9:1:11101:9270:1050 length=98
NCCATGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR11772847.2.3 NB502129:188:HY73HBGX9:1:11101:9270:1050 length=98
#A/AA//###########################################################################################
@SRR11772847.3.3 NB502129:188:HY73HBGX9:1:11101:16784:1053 length=98
NAAAAGAATATCTGTCCTANNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR11772847.3.3 NB502129:188:HY73HBGX9:1:11101:16784:1053 length=98
#A/AAEEEEEEAEEEEEAA##E############################################################################
@SRR11772847.4.3 NB502129:188:HY73HBGX9:1:11101:20118:1053 length=98
NAGGAGGATGAAGGCTTACNNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR11772847.4.3 NB502129:188:HY73HBGX9:1:11101:20118:1053 length=98
#A6AAE/AE/E/E/EEA//##<############################################################################
@SRR11772847.5.3 NB502129:188:HY73HBGX9:1:11101:13559:1054 length=98
NTTTTAGTTGGTCTTCATCTNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR11772847.5.3 NB502129:188:HY73HBGX9:1:11101:13559:1054 length=98
#AAAA/<E//EEEE6A/AE/#<############################################################################

I'm quite a beginner, do I miss something or the data for this run is incomplete ? I have an analogical problem with SRR7049900 run

i also tried ./fastq-dump -I --split-files SRR11772847 but get 3 fastq with reads of size 8bp, 26bp and 98bp. I should get an other fastq of size 98bp (read2 of the paired-end sequencing), i don't understand.

Thank you by advance for your help,

Matt

rna-seq • 183 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Matt10

Thank you very much for this answer, very clear very complete, i successfully used Cellranger with that.

ADD REPLYlink written 5 weeks ago by Matt10
3
gravatar for ATpoint
5 weeks ago by
ATpoint46k
ATpoint46k wrote:

The output of --split-files is correct, see https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11772847

This is 10X scRNA-seq. The 8bp file is the index read, the 26bp is the cellular barcode + UMI read and the 98bp one is the cDNA. The authors did not upload 98bp for R1 since everything beyond the first 26bp is meaningless in this kind of assay. Technically speaking this kind of assay is basically single-end as you by design only get one "biological" read for the cDNA which is R2. The three files you obtain are the required input for CellRanger which is the standard processing tool. Alternatives are STARsolo and lightweight quantifiers such as the alevin module from the salmon software.

See here this scheme for a 10X v2 library.

enter image description here

If you need further clarification please feel free to comment.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by ATpoint46k
1

Thank you very much for this answer, very clear very complete, i successfully used Cellranger with that.

ADD REPLYlink written 5 weeks ago by Matt10

I don't want to abuse of your time but when I see this run https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR7049900 created with 10x v2 library. in the NCBI database we only have one read of 110bp however in the related paper we can see "Read 1 -26 cycles, i7 index-8cycles, i5 index : 0 cycles, Read 2 : 110 cycles"

According to your previous explanation and the paper do you think there are missing data in the NCBI data ? Without barcodes/UMI we can't analyze that with CellRanger

ADD REPLYlink written 29 days ago by Matt10
1

10x data in SRA is hit and miss. There is no standard protocol that submitters and/or SRA seem to follow. Best bet is to look under the Data Access tab in the link you posted above and see if the section on Original format has BAM files available. In this case it looks like there is one. You can use the bamtofastq utility (LINK) provided by 10x to recreated the reads from this BAM.

ADD REPLYlink written 29 days ago by GenoMax96k

Thank you for the trick that seems to work ! Quite strange that the data provided is not verified ^^ during the publication process.

ADD REPLYlink written 24 days ago by Matt10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 962 users visited in the last hour
_