Question: Bulk download of entire BioProject SRA
2
gravatar for Anand Rao
20 months ago by
Anand Rao320
United States
Anand Rao320 wrote:

I am trying to download entire dataset for a bioproject using esearch and efetch from the Entrez Utilities.

My syntax is based on syntax posted by @Istvan Albert at C: How to download raw sequence data from GEO/SRA, which is

esearch -db sra -query PRJNA40075 | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -5 | xargs fastq-dump -X 10 --split-files

For the BioProject PRJNA269201 I am interested in, slightly truncated syntax as shown below, creates 144 empty files as expected:

esearch -db sra -query PRJNA269201  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs touch

However, when I try the full-length syntax, it behaves differently from what I expected under both scenarios 1 and 2 detailed below:

Scenario 1. On head-node of a cluster:

esearch -db sra -query PRJNA269201  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -2 | xargs fastq-dump --split-files

one file finished download, but it is 5.5G which is way larger than the 1.2GB I expected based on info at this link - is this difference because of file compression?! How can I download to a much more compressed version for both storage and downstream RNA-Seq analyses?

-rw-rw-r-- 1 aksrao aksrao 1.1G Jan 19 19:47 SRR1726554_1.fastq

-rw-rw-r-- 1 aksrao aksrao 5.5G Jan 19 19:44 SRR1726553_1.fastq

Scenario 2. When I try to submit this as a shell script, the STDERR stream (SLURM queue management on UBUNTU cluster) captures the following error message:

2019-01-20T02:28:55 fastq-dump.2.8.2 err: param empty while validating argument list - expected accession

This same problem was reported on the original post by user @ bandanaschapagain, but it may not have been answered and resolved, hence I am posting this afresh. Could someone please help me? Thank you!

ADD COMMENTlink modified 20 months ago by Arup Ghosh2.7k • written 20 months ago by Anand Rao320
6
gravatar for Arup Ghosh
20 months ago by
Arup Ghosh2.7k
India
Arup Ghosh2.7k wrote:

Download the RunInfo table and use parallel to download multiple files at once.

#!/bin/bash
#change the number after  -j change the number of files to be processed.
parallel --verbose -j 20 prefetch {} ::: $(cut -f5 SraRunTable.txt ) >>sra_download.log
wait
parallel --verbose -j 20 fastq-dump --split-files {} ::: $(cut -f5 SraRunTable.txt ) >>sra_dump.log
wait
exit
ADD COMMENTlink written 20 months ago by Arup Ghosh2.7k
4

I would always avoid fastq-dump to directly load files from the SRA as it tends to be unstable. Better download the SRA files to disk with prefetch and then use fastq-dump on them, given that the data are not backed-up at the ENA in fastq format directly.

ADD REPLYlink written 20 months ago by ATpoint38k

No need for wait. GNU Parallel does that for you.

ADD REPLYlink written 6 weeks ago by ole.tange3.9k
4
gravatar for manuel.belmadani
20 months ago by
Canada
manuel.belmadani1.2k wrote:

I don't think you're doing anything wrong; the first run (SRR1726554) matches what's on SRA (1.1G). I downloaded SRR1726553 myself and also got a .fastq file of 5.6G. It could be that the SRA metadata is wrong; I would contact them and ask for more information (e-mail at sra@ncbi.nlm.nih.gov).

You can get a compressed version by calling --gzip in your fastq-dump calls. Most aligners will accept gzipped fastq files as input. My full fastq-dump command is:
fastq-dump $SRA_FILE --outdir $SRA_DIR --gzip --skip-technical --readids --dumpbase --split-files --clip

I would recommend reading the Edward's lab fastq-dump article to learn more about some useful options.

For your error message in Scenario 2; I suspect your accession is not getting passed correctly (based on the expected accession part.) Maybe add some print calls (like wrapping the fastq-dump part with an echo and writing it to a file) in your shell script to see what command it's actually trying execute? The error message seems rather rare, so maybe it's worth asking SRA support about that too.

ADD COMMENTlink modified 20 months ago • written 20 months ago by manuel.belmadani1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1474 users visited in the last hour