Question: Bulk download of entire BioProject SRA
7 months ago by
Anand Rao210
United States
Anand Rao210 wrote:

I am trying to download entire dataset for a bioproject using esearch and efetch from the Entrez Utilities.

My syntax is based on syntax posted by @Istvan Albert at C: How to download raw sequence data from GEO/SRA, which is

esearch -db sra -query PRJNA40075 | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -5 | xargs fastq-dump -X 10 --split-files

For the BioProject PRJNA269201 I am interested in, slightly truncated syntax as shown below, creates 144 empty files as expected:

esearch -db sra -query PRJNA269201  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs touch

However, when I try the full-length syntax, it behaves differently from what I expected under both scenarios 1 and 2 detailed below:

Scenario 1. On head-node of a cluster:

esearch -db sra -query PRJNA269201  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -2 | xargs fastq-dump --split-files

one file finished download, but it is 5.5G which is way larger than the 1.2GB I expected based on info at this link - is this difference because of file compression?! How can I download to a much more compressed version for both storage and downstream RNA-Seq analyses?

-rw-rw-r-- 1 aksrao aksrao 1.1G Jan 19 19:47 SRR1726554_1.fastq

-rw-rw-r-- 1 aksrao aksrao 5.5G Jan 19 19:44 SRR1726553_1.fastq

Scenario 2. When I try to submit this as a shell script, the STDERR stream (SLURM queue management on UBUNTU cluster) captures the following error message:

2019-01-20T02:28:55 fastq-dump.2.8.2 err: param empty while validating argument list - expected accession

This same problem was reported on the original post by user @ bandanaschapagain, but it may not have been answered and resolved, hence I am posting this afresh. Could someone please help me? Thank you!

ADD COMMENTlink modified 6 months ago by arup1.5k • written 7 months ago by Anand Rao210
6 months ago by
arup1.5k wrote:

Download the RunInfo table and use parallel to download multiple files at once.

#change the number after  -j change the number of files to be processed.
parallel --verbose -j 20 prefetch {} ::: $(cut -f5 SraRunTable.txt ) >>sra_download.log
parallel --verbose -j 20 fastq-dump --split-files {} ::: $(cut -f5 SraRunTable.txt ) >>sra_dump.log
ADD COMMENTlink written 6 months ago by arup1.5k

I would always avoid fastq-dump to directly load files from the SRA as it tends to be unstable. Better download the SRA files to disk with prefetch and then use fastq-dump on them, given that the data are not backed-up at the ENA in fastq format directly.

ADD REPLYlink written 6 months ago by ATpoint21k
6 months ago by
manuel.belmadani1.1k wrote:

I don't think you're doing anything wrong; the first run (SRR1726554) matches what's on SRA (1.1G). I downloaded SRR1726553 myself and also got a .fastq file of 5.6G. It could be that the SRA metadata is wrong; I would contact them and ask for more information (e-mail at

You can get a compressed version by calling --gzip in your fastq-dump calls. Most aligners will accept gzipped fastq files as input. My full fastq-dump command is:
fastq-dump $SRA_FILE --outdir $SRA_DIR --gzip --skip-technical --readids --dumpbase --split-files --clip

I would recommend reading the Edward's lab fastq-dump article to learn more about some useful options.

For your error message in Scenario 2; I suspect your accession is not getting passed correctly (based on the expected accession part.) Maybe add some print calls (like wrapping the fastq-dump part with an echo and writing it to a file) in your shell script to see what command it's actually trying execute? The error message seems rather rare, so maybe it's worth asking SRA support about that too.

ADD COMMENTlink modified 6 months ago • written 6 months ago by manuel.belmadani1.1k
