Question

understanding the output files of fasterq-dump --split-files

0

Entering edit mode

5.7 years ago

inbal.tzipermanl • 0

I am using fasterq-dump to download from sra, and using split-files to split paired end reads. as a result I receive one or two files. when I have two files they are in the format *_1.fastq, and another file *_2.fastq or *_3.fastq or *_4.fastq I cannot find what is the meaning of these numbers?

the command I am using:

fasterq-dump --split-files -O /media/lab/fastq ERR016705

for example:

ERR016705 has 2 files: _1, _4 ERR015587 has 2 files: _1, _2

fasterq-dump • 8.9k views

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 5.7 years ago by inbal.tzipermanl • 0

0

Entering edit mode

I am also confused with this. On the HowTo page, they say you could get 1 and 2.fastq files for paired reads, and a 3.fastq for unmated. But on item 8, they list 1 and 2 and a simple .fastq. Is this simple .fastq also for unmated reads? Or is it different from the 3.fastq? After reading this post, I'm not sure if the .fastq file contains the unmated reads, or low quality reads and should be ignored.

ADD REPLY • link 3.1 years ago by rturba ▴ 10

1

Entering edit mode

That option was likely used with older data if you are looking at something recent then chances of getting a third file should be small unless submitters have supplied data from an index read as a separate file. In case of single cell data 10x cellranger software produces a separate file index reads when used for demultiplexing.

ADD REPLY • link 3.1 years ago by GenoMax 141k

0

Entering edit mode

Anyone reaching this post by search in future ERR016705 now shows just two fastq files at ENA.

ADD REPLY • link 3.1 years ago by GenoMax 141k

score 0 · Answer 1 · 2018-08-12

Have you looked at the headers of the fastq files. Even though the files themselves are named 1 and 4 the headers should tell you that these are R1 and R2 files.
(Note: Illumina sequencing happens in Read 1 --> Index 1 --> Index 2 --> Read 2 order. Sometimes people may dump index sequences into individual files and in that case output files have File 1 --> File 2 --> File 3 --> File 4 names.)