Question: Bulk download of entire BioProject SRA
2
gravatar for Anand Rao
4 weeks ago by
Anand Rao210
United States
Anand Rao210 wrote:

I am trying to download entire dataset for a bioproject using esearch and efetch from the Entrez Utilities.

My syntax is based on syntax posted by @Istvan Albert at C: How to download raw sequence data from GEO/SRA, which is

esearch -db sra -query PRJNA40075 | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -5 | xargs fastq-dump -X 10 --split-files

For the BioProject PRJNA269201 I am interested in, slightly truncated syntax as shown below, creates 144 empty files as expected:

esearch -db sra -query PRJNA269201  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs touch

However, when I try the full-length syntax, it behaves differently from what I expected under both scenarios 1 and 2 detailed below:

Scenario 1. On head-node of a cluster:

esearch -db sra -query PRJNA269201  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -2 | xargs fastq-dump --split-files

one file finished download, but it is 5.5G which is way larger than the 1.2GB I expected based on info at this link - is this difference because of file compression?! How can I download to a much more compressed version for both storage and downstream RNA-Seq analyses?

-rw-rw-r-- 1 aksrao aksrao 1.1G Jan 19 19:47 SRR1726554_1.fastq

-rw-rw-r-- 1 aksrao aksrao 5.5G Jan 19 19:44 SRR1726553_1.fastq

Scenario 2. When I try to submit this as a shell script, the STDERR stream (SLURM queue management on UBUNTU cluster) captures the following error message:

2019-01-20T02:28:55 fastq-dump.2.8.2 err: param empty while validating argument list - expected accession

This same problem was reported on the original post by user @ bandanaschapagain, but it may not have been answered and resolved, hence I am posting this afresh. Could someone please help me? Thank you!

ADD COMMENTlink modified 28 days ago by arup850 • written 4 weeks ago by Anand Rao210
4
gravatar for arup
28 days ago by
arup850
India
arup850 wrote:

Download the RunInfo table and use parallel to download multiple files at once.

#!/bin/bash
#change the number after  -j change the number of files to be processed.
parallel --verbose -j 20 prefetch {} ::: $(cut -f5 SraRunTable.txt ) >>sra_download.log
wait
parallel --verbose -j 20 fastq-dump --split-files {} ::: $(cut -f5 SraRunTable.txt ) >>sra_dump.log
wait
exit
ADD COMMENTlink written 28 days ago by arup850
3

I would always avoid fastq-dump to directly load files from the SRA as it tends to be unstable. Better download the SRA files to disk with prefetch and then use fastq-dump on them, given that the data are not backed-up at the ENA in fastq format directly.

ADD REPLYlink written 28 days ago by ATpoint13k
3
gravatar for manuel.belmadani
29 days ago by
Canada
manuel.belmadani490 wrote:

I don't think you're doing anything wrong; the first run (SRR1726554) matches what's on SRA (1.1G). I downloaded SRR1726553 myself and also got a .fastq file of 5.6G. It could be that the SRA metadata is wrong; I would contact them and ask for more information (e-mail at sra@ncbi.nlm.nih.gov).

You can get a compressed version by calling --gzip in your fastq-dump calls. Most aligners will accept gzipped fastq files as input. My full fastq-dump command is:
fastq-dump $SRA_FILE --outdir $SRA_DIR --gzip --skip-technical --readids --dumpbase --split-files --clip

I would recommend reading the Edward's lab fastq-dump article to learn more about some useful options.

For your error message in Scenario 2; I suspect your accession is not getting passed correctly (based on the expected accession part.) Maybe add some print calls (like wrapping the fastq-dump part with an echo and writing it to a file) in your shell script to see what command it's actually trying execute? The error message seems rather rare, so maybe it's worth asking SRA support about that too.

ADD COMMENTlink modified 29 days ago • written 29 days ago by manuel.belmadani490
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1931 users visited in the last hour