Question: Bulk download of entire BioProject SRA
2
gravatar for Anand Rao
7 months ago by
Anand Rao210
United States
Anand Rao210 wrote:

I am trying to download entire dataset for a bioproject using esearch and efetch from the Entrez Utilities.

My syntax is based on syntax posted by @Istvan Albert at C: How to download raw sequence data from GEO/SRA, which is

esearch -db sra -query PRJNA40075 | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -5 | xargs fastq-dump -X 10 --split-files

For the BioProject PRJNA269201 I am interested in, slightly truncated syntax as shown below, creates 144 empty files as expected:

esearch -db sra -query PRJNA269201  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs touch

However, when I try the full-length syntax, it behaves differently from what I expected under both scenarios 1 and 2 detailed below:

Scenario 1. On head-node of a cluster:

esearch -db sra -query PRJNA269201  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -2 | xargs fastq-dump --split-files

one file finished download, but it is 5.5G which is way larger than the 1.2GB I expected based on info at this link - is this difference because of file compression?! How can I download to a much more compressed version for both storage and downstream RNA-Seq analyses?

-rw-rw-r-- 1 aksrao aksrao 1.1G Jan 19 19:47 SRR1726554_1.fastq

-rw-rw-r-- 1 aksrao aksrao 5.5G Jan 19 19:44 SRR1726553_1.fastq

Scenario 2. When I try to submit this as a shell script, the STDERR stream (SLURM queue management on UBUNTU cluster) captures the following error message:

2019-01-20T02:28:55 fastq-dump.2.8.2 err: param empty while validating argument list - expected accession

This same problem was reported on the original post by user @ bandanaschapagain, but it may not have been answered and resolved, hence I am posting this afresh. Could someone please help me? Thank you!

ADD COMMENTlink modified 6 months ago by arup1.5k • written 7 months ago by Anand Rao210
5
gravatar for arup
6 months ago by
arup1.5k
India
arup1.5k wrote:

Download the RunInfo table and use parallel to download multiple files at once.

#!/bin/bash
#change the number after  -j change the number of files to be processed.
parallel --verbose -j 20 prefetch {} ::: $(cut -f5 SraRunTable.txt ) >>sra_download.log
wait
parallel --verbose -j 20 fastq-dump --split-files {} ::: $(cut -f5 SraRunTable.txt ) >>sra_dump.log
wait
exit
ADD COMMENTlink written 6 months ago by arup1.5k
4

I would always avoid fastq-dump to directly load files from the SRA as it tends to be unstable. Better download the SRA files to disk with prefetch and then use fastq-dump on them, given that the data are not backed-up at the ENA in fastq format directly.

ADD REPLYlink written 6 months ago by ATpoint21k
4
gravatar for manuel.belmadani
6 months ago by
Canada
manuel.belmadani1.1k wrote:

I don't think you're doing anything wrong; the first run (SRR1726554) matches what's on SRA (1.1G). I downloaded SRR1726553 myself and also got a .fastq file of 5.6G. It could be that the SRA metadata is wrong; I would contact them and ask for more information (e-mail at sra@ncbi.nlm.nih.gov).

You can get a compressed version by calling --gzip in your fastq-dump calls. Most aligners will accept gzipped fastq files as input. My full fastq-dump command is:
fastq-dump $SRA_FILE --outdir $SRA_DIR --gzip --skip-technical --readids --dumpbase --split-files --clip

I would recommend reading the Edward's lab fastq-dump article to learn more about some useful options.

For your error message in Scenario 2; I suspect your accession is not getting passed correctly (based on the expected accession part.) Maybe add some print calls (like wrapping the fastq-dump part with an echo and writing it to a file) in your shell script to see what command it's actually trying execute? The error message seems rather rare, so maybe it's worth asking SRA support about that too.

ADD COMMENTlink modified 6 months ago • written 6 months ago by manuel.belmadani1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 572 users visited in the last hour