Efficient way to download SRA dataset
0
0
Entering edit mode
2.5 years ago

I want to download the SRP325386 dataset for a class deep sequencing analysis project. It contains 199898 samples.

I used the following command:

esearch -db sra -query SRP325386 | efetch -format runinfo | cut -d ',' -f 1 | grep SRR | xargs -n 1 -P 4 fastq-dump --split-3 --gzip --skip-technical --readids -W --read-filter pass 

After 12 hours, it's still downloading (5152 items thus far).

Did I use the wrong command? Is there a more efficient way to download datasets from SRA?

sra • 747 views
ADD COMMENT
0
Entering edit mode

You could try to replace fastq-dump with fasterq-dump. Alternatively, have a look at nf-core fetchngs, which makes the download even more convenient, because it parallelizes the downloading whereas your script will download sequentially.

ADD REPLY
0
Entering edit mode

Thanks! Another trivial question. What constitutes a SRA "Dataset"? Is it denoted with a "SRP" prefix?

ADD REPLY
0
Entering edit mode

SRA is the Sequence Read Archive operated by the NCBI. There are projects (SRP), runs (SRR), experiments (SRX) and samples (SAMN). Typically, one will want to download all the data of a study/project, but also arbitrary subsets of those are possible.

fetchngs should be able to resolve whatever SRA ID it is provided with and download the accompanying data. Getting started is initially a bit more work, but once you know how to run nextflow and the nf-core pipelines, they will be a huge time-saver plus you usually get results according to the best practices.

ADD REPLY
0
Entering edit mode

You have a huge number of samples. Are you sure you actually want to download them all? I hope you have enough storage/bandwidth locally since you may run out of one of them before SRA does :-)

ADD REPLY

Login before adding your answer.

Traffic: 1233 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6