Question

How do I split a combined fastq file downloaded from SRA into separate _1.fq and _2.fq read pairs?

2

Entering edit mode

9.7 years ago

simonH ▴ 20

I've downloaded a fastq file from SRA (http://trace.ncbi.nlm.nih.gov/Traces/sra/) containing reads from a paired-end Illumina 101 bp RNAseq experiment. The only problem is, it contains both read pairs in a single file, whereas I need separate files with all the _1.fq reads in one and the _2.fq reads in another.

Can anybody help? I'm aware of the fastq-dump tool within the SRA Toolkit, but I couldn't get it to work when I was originally downloading the data.

Many thanks in advance.

My fastq file looks like this:

$ head sra_data.fastq
@SRR1659960.1.1 1 length=101
NAGAAATGAATGAGCCTACAGATGATAGGATGTTTCATGTGGTGTATGCATCGGGGTAGTCCGAGTAACGTCGGGGCATTCCGGATAGGCCGAGAAAGTGT
+SRR1659960.1.1 1 length=101
#1=BDDDDDHFBFIEHHHHAG<HE@HGGE@HHFGHGGHHFHIHG@FFGGGHIIIIIFAC=F@GEGEECCDCECCBBBBCCCD>9599>C:@>5@9>?CCCD
@SRR1659960.1.2 1 length=101
CCCACTTCCACTATGTCCTATCAATAGGAGCTGTATTTGCCATCATAGGAGGCTTCATTCACTGATTTCCCCTATTCTCAGGCTACACCCTAGACCAAACC
+SRR1659960.1.2 1 length=101
<7?BD?DD<DFFABBEHEEFHII>C:BCDD?<C?FFC4E>@DEF>?FGHDFBBCG8??DGGIII:BF@C=FFC;C=D;@?EA76?DDBEC?>>ACCCABBB
@SRR1659960.2.1 2 length=101
NATAAAGTGTATGACAAATATACAAGGCTCCTAATATTGGTTTAACTTGGAGAAGTAGGTAAAGGAAGAAGGGNAAAGGAAATAGACAAAAAGACTACAGT

sequence RNA-Seq • 4.7k views

ADD COMMENT • link updated 9.7 years ago by pevsner ▴ 420 • written 9.7 years ago by simonH ▴ 20

2

Entering edit mode

9.7 years ago

h.mon 35k

fastq-dump --split-files

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by h.mon 35k

0

Entering edit mode

I've tried this, but I get the following error message:

$ sratoolkit.2.5.4-1-centos_linux64/bin/fastq-dump --split-files sra_data.fastq
2015-10-15T22:59:03 fastq-dump.2.5.4 err: item not found while constructing within virtual database module - the path 'sra_data.fastq' cannot be opened as database or table

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by simonH ▴ 20

1

Entering edit mode

9.7 years ago

pevsner ▴ 420

Try this:

fastq-dump --split-files SRR1659960

For a description of the --split-files argument try:

fastq-dump --help

You can track the progress of the download by checking the file sizes in your directory:

ls -lh SRR1659960_*

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by pevsner ▴ 420

0

Entering edit mode

Thanks, I'm doing this now. I was hoping to find a way that didn't involve re-downloading the whole dataset (it's 49 gb in compressed form), but it looks like I'll have to. Cheers

ADD REPLY • link 9.7 years ago by simonH ▴ 20

Ram · Accepted Answer · 2015-10-15

4

Entering edit mode

9.7 years ago

Brian Bushnell 20k

Use Reformat from the BBMap package:

reformat.sh in=sra_data.fastq out1=r1.fq out2=r2.fq interleaved

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Sorry, I missed this earlier. Thanks! I'm downloading BBMap now and will report back