Question: fastq-dump returns I1 and R1 files instead of R1 and R2
0
gravatar for requiem_data
12 weeks ago by
requiem_data0 wrote:

I have downloaded the data corresponding to SRR8712342 by doing:

prefetch SRR8712342

I than tried to get the fastq files with fastq-dump. Because it is a 10x scRNA-seq data set, I used the following options:

fastq-dump --split-files --outdir fastq --gzip --readids --read-filter pass --dumpbase --clip --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' SRR8712342/SRR8712342.sra -I

I get two fastq files (SRR8712342_pass_2.fastq.gz and SRR8712342_pass_3.fastq.gz). The number 2 corresponds to what I think is the usually called R1. Here are the first reads:

@SRR8712342.1.CACGCCTT/2
NTTTTGGGCCCCTACTCTATTCCTTTTATGCAAACCTCACAGAATTTTAACCAGAAAGGCCAGGCAGGATGGCTCACGCCTGTAATCACAGCGCTTTG
+
#AAAAEE<</<EEEEEEEEEEE/A<EEAEEA<///EEE/E/E/AEEEE<///E////A<E/<E<E/EE/EEAEE</A/A/E/<<//E/A/AE//////
@SRR8712342.2.CACGCCTT/2
NAAGAGGAACTGCTGGCCACGAGTACGGGGTGTGGCCATGAATCCTGTGGAGCATCCTTTTGGAGGTGGCAACCACCAGCACATCGGCAAGCCCTCCA
+
#AAAAEEEEEEEEEEEEEEEE<EEEEAEEEEEEEAEEEEE</EEEE<EEEAEEEEEEE<AEE//EEEEAE/<EEAEEEAE6EEEAAAEA/6EEEEEE/
@SRR8712342.3.CACGCCTT/2
NTGAAGATCATGCTGCCCTGGGACCCAACTGGTAAGATTGGCCCTAAGAAGCCCCTGCCTGACCACGTGAGCATTGTGGAACCCAAAGATGAGATACT

However, I think the number 3 contains the indices (the, I think, so-called I1 file) instead of the barcodes and UMIs (R2 file). Here are the first reads:

   @SRR8712342.sra.1 NB500934:132:HNVKHBGX2:1:11101:1446:1079 length=8
CACGCCTT
+SRR8712342.sra.1 NB500934:132:HNVKHBGX2:1:11101:1446:1079 length=8
AAAAAEAA
@SRR8712342.sra.2 NB500934:132:HNVKHBGX2:1:11101:7122:1079 length=8
CACGCCTT
+SRR8712342.sra.2 NB500934:132:HNVKHBGX2:1:11101:7122:1079 length=8
AAAAAEEE
@SRR8712342.sra.3 NB500934:132:HNVKHBGX2:1:11101:5641:1080 length=8
CACGCCTT
+SRR8712342.sra.3 NB500934:132:HNVKHBGX2:1:11101:5641:1080 length=8
AAAAAEEE
@SRR8712342.sra.4 NB500934:132:HNVKHBGX2:1:11101:17715:1080 length=8
CACGCCTT
+SRR8712342.sra.4 NB500934:132:HNVKHBGX2:1:11101:17715:1080 length=8
A//AA///
@SRR8712342.sra.5 NB500934:132:HNVKHBGX2:1:11101:22159:1080 length=8
CACGCCTT
+SRR8712342.sra.5 NB500934:132:HNVKHBGX2:1:11101:22159:1080 length=8
AAAAAEEE

I know the R1 reads are in there (see https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8712342, ticking both technical and biological reads). How can I tell fastq-dump to retrieve the correct files?

question fastq-dump reads • 193 views
ADD COMMENTlink written 12 weeks ago by requiem_data0
0
gravatar for GenoMax
12 weeks ago by
GenoMax95k
United States
GenoMax95k wrote:

I am able to get three files using fastq-dump. Using sra-toolkit v.2.10.5.

fastq-dump -F --split-files SRR8712342.sra

@NB500934:132:HNVKHBGX2:1:11101:1446:1079
CGAGCNCGTAAGGATTTTTCAGAATG
+NB500934:132:HNVKHBGX2:1:11101:1446:1079
AAAAA#AEEEEEEEEEEEEEEEEEEE

@NB500934:132:HNVKHBGX2:1:11101:1446:1079
NTTTTGGGCCCCTACTCTATTCCTTTTATGCAAACCTCACAGAATTTTAACCAGAAAGGCCAGGCAGGATGGCTCACGCCTGTAATCACAGCGCTTTG
+NB500934:132:HNVKHBGX2:1:11101:1446:1079
#AAAAEE<</<EEEEEEEEEEE/A<EEAEEA<///EEE/E/E/AEEEE<///E////A<E/<E<E/EE/EEAEE</A/A/E/<<//E/A/AE//////

@NB500934:132:HNVKHBGX2:1:11101:1446:1079
CACGCCTT
+NB500934:132:HNVKHBGX2:1:11101:1446:1079
AAAAAEAA
ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by GenoMax95k

Oh, interesting, thanks a lot! Do you happen to know which of my fastq-dump options is the culprit for getting only 2 files?

ADD REPLYlink written 12 weeks ago by requiem_data0

I got curious and I looked up SRR8712342 on ENA archive. It looks there is only 1 fastq file there even if the run is paired-end. All reads have length 98bp and read names all end with "/2" which (sometimes) means read 2. Odd... I don't know if it has anything to do with scRNAseq with which I'm not familiar with...

ADD REPLYlink written 12 weeks ago by dariober11k
1

10x data is starting to look like a dumpster fire on both NCBI (and ENA). If the original BAM's are available then that is the safest way to go.

Note: ENA seems to be using the default /2 designation of old. Likely to designate that it is R2 which is the actual read in 10x.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by GenoMax95k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1702 users visited in the last hour
_