Question: Is There A Faster Replacement For Fastq-Dump (From The Sra-Toolkit)?
15
gravatar for Devon Ryan
4.4 years ago by
Devon Ryan81k
Freiburg, Germany
Devon Ryan81k wrote:

I occasionally need to reprocess previously published datasets, often stored in the Short Read Archive. For the most part, I need the raw fastq files, so I use fastq-dump. While that gets the job done, it's annoyingly slow for what it's actually doing. Has anyone come across a program that can more quickly extract reads in fastq format from SRA files? While I could presumably write a faster program, I'd like to avoid reinventing the wheel if needed.

I should note that I'm aware that many datasets are available in fastq format via ENA, but unfortunately they all aren't.

sra fastq • 5.6k views
ADD COMMENTlink modified 4 months ago by ATpoint4.4k • written 4.4 years ago by Devon Ryan81k
4

Incidentally the way fastq-dump works has many other limitations - the way it handles the internet connection and its "security handshakes" (what that is I don't know) get in the way.

It is the only bioinformatics program so far that does not work on Bash on Windows! Think about that for a second. For my book I had to find a simple replacement for it and came up with the wonderdump a replacement for the network access of fastq-dump, it uses a plain and faster curl for that - it does indeed work much faster than the regular fastq-dump.

http://data.biostarhandbook.com/scripts/wonderdump.sh

ADD REPLYlink modified 19 months ago • written 19 months ago by Istvan Albert ♦♦ 77k
2

Last I checked, Japan's SRA mirror has not yet moved over to the binrary SRA format yet so you can still grab fastqs off there. Might be a useful workaround

ADD REPLYlink written 4.4 years ago by Ying W3.8k

Oh the amount of time noticing that would have saved me! :P Good on the Japanese for so far avoiding the SRA format annoyance!

ADD REPLYlink written 4.4 years ago by Devon Ryan81k
1

honestly I really doubt it - but I would agree that this binary SRA format is a major PITA

ADD REPLYlink written 4.4 years ago by Istvan Albert ♦♦ 77k

I also doubt anyone has written a second method if this one works and is supported. Dumping from SRA isn't a task you have to do repeatedly and therefore is not a target for optimization. Try to use your fastest hard-drives, assuming IO is the bottleneck. Many Linux systems have a ram-drive on /dev/shm you should look into for vastly sped-up IO.

ADD REPLYlink written 4.4 years ago by karl.stamm3.2k
1

I/O is not a bottle-neck. In fact, I can invoke an instance for every core on my workstation and still not max I/O (or come very close for that matter).

ADD REPLYlink written 4.4 years ago by Devon Ryan81k

fastq-dump will never to be a good choice. download speed is not so fastq and always with some confusing problems, such as 2016-11-12T08:33:35 fastq-dump.2.7 err: item not found while constructing within virtual database module - the path 'SRR1286321' cannot be opened as database or table. I prefer to wget or curl.

ADD REPLYlink written 19 months ago by Shicheng Guo4.9k
1
gravatar for ATpoint
4 months ago by
ATpoint4.4k
Germany
ATpoint4.4k wrote:

I found parallel-fastq-dump quiet useful, a wrapper from Renan Valieris that makes use of the -N and -X options of fastq-dump to convert multiple chunks of the SRA in parallel, merging them chunks after sucessful conversion into the final fastq. It requires python3 and worked well for in my hands. Easy install with conda: conda install parallel-fastq-dump

ADD COMMENTlink written 4 months ago by ATpoint4.4k
0
gravatar for endrebak
23 months ago by
endrebak640
endrebak640 wrote:

sam-dump seems a lot faster for me:

sam-dump ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP038/SRP038893/SRR1509032/SRR1509032.sra | head | grep -v '^@' | awk '{print "@"$1"\n"$10"\n+\n"$11}'

Or if your data is paired:

sam-dump <your_data> | grep -v '^@' | awk 'NR%2==1 {print "@"$1"\n"$10"\n+\n"$11}' > samplename_1.fastq 
sam-dump <your_data> | grep -v '^@' | awk 'NR%2==0 {print "@"$1"\n"$10"\n+\n"$11}' > samplename_2.fastq

( awk lines stolen from here: http://www.cureffi.org/2013/07/04/how-to-convert-sam-to-fastq-with-unix-command-line-tools/ )

ADD COMMENTlink written 23 months ago by endrebak640
0
gravatar for sutturka
4 months ago by
sutturka120
USA
sutturka120 wrote:

Please check my answer in this thread. It might be useful.

ADD COMMENTlink written 4 months ago by sutturka120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1957 users visited in the last hour