Question: How to use GNU parallel to download SRA files
1
gravatar for Bioinfonext
8 months ago by
Bioinfonext110
Korea
Bioinfonext110 wrote:

Hi,

How I can use GNU parallel to download SRA files fast in the below command:

nohup /mnt//sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump --split-3 --gzip SRR1785709 SRR1785715 SRR1785721 SRR1785728 SRR1785734 SRR1785742 SRR1785744 >nohup.out &
rna-seq • 577 views
ADD COMMENTlink modified 7 weeks ago by Minstein0 • written 8 months ago by Bioinfonext110
1

If possible use EBI-ENA to get the fastq files directly.

Consider that you may be saturating incoming bandwidth on the network connection (once you get this to work). If you are on a shared machine/cluster that can cause issues for others.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax57k

@Pierre's parallel tutorial.

ADD REPLYlink written 8 months ago by genomax57k
2
gravatar for tiago211287
8 months ago by
tiago2112871.0k
USA
tiago2112871.0k wrote:

As the sizes of the datasets have increased, we have found that the traditional methods of FTP or HTTP do not have the performance characteristics needed to support this load of data. FTP performance degrades proportionally with the number of hops or switches the data must take to get to you. Aspera performance does not degrade with distance. Aspera is typically 10 times faster than FTP and reduces the chance of drops or time-outs in the middle of a transfer. Best-case transfer rates for ascp are ~ 600 Mbps, while typical rates are closer to 100-200 Mbps. [Aspera Transfer Guide][1]

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

Inside the server/cluster do:

Download Aspera Connect to Linux

execute the shell script:

./aspera-connect-Version-linux-64.sh

aspera will be put on the path:

$HOME/.aspera/connect/bin/

Make a text file with the accession IDs, one way is to cat into a empty file and paste. End cat with CtrL+D:

cat > accessions.txt
SRR1346053
SRR1346054
SRR1346055
SRR1346056
SRR1346057
SRR1346058
SRR1346059

Use GNU parallel

parallel  --max-procs 1 --xapply ascp -v -k 1 -l50m -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR134/{1}/{1}.sra $HOME/OUTPUT/FOLDER/ :::: accessions.txt

Explanation

--max-procs 1 -> allows the download of only 1 item at a time. -v -> verbose mode

-k 1 -> allow you to restart incomplete transfers

-l50m -> limits the band to 50Mbps (~5Mb/second)

-i asperaweb_id_dsa.openssh, public key

{SRR|ERR|DRR} should be either ‘SRR’, ‘ERR’, or ‘DRR’ and should match the prefix of the target .sra file

path to the files: /sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6="" characters="" of="" accession="">/<accession>/<accession>.sra

Transform all sras in raw fastq files:

find $PWD -name "*.sra" | parallel --maxprocs N fastq-dump --split-files {1}

N = number of simulteneous instances (maximum number of cores to process requests).

ADD COMMENTlink modified 8 months ago by genomax57k • written 8 months ago by tiago2112871.0k
2

While all this is great information, OP (Bioinfonext ) should definitely talk with local cluster admins before doing this. It could put a lot of load on the head node (if run there) and/or gum up the network (such that no one else may be able to do anything).

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax57k

You are totally right. I made it myself when I was learning. A way of using without upsetting coworkers is limit the network band with less or equal to 50 Mbps "-l50m" and always set the --maxprocs parameter in parallel to a low value.

Talk with the admin is a good idea.

ADD REPLYlink modified 8 months ago • written 8 months ago by tiago2112871.0k

If you're adding max-procs and setting it to a single thread - there's no point parallelising...? (Concerns about OP sucking up all the bandwidth aside).

ADD REPLYlink written 8 months ago by jrj.healey7.6k

Actually, there is, using max-procs 1, you avoid loops.

ADD REPLYlink written 8 months ago by tiago2112871.0k

IMO in general there's no point to try to parallelize downloads. It will not magically increase your download bandwidth, nor increase the speed at which any decently configured server serves you files. Instead, it might lead to the server flagging you and banning your IP address

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by 5heikki7.7k
0
gravatar for sutturka
8 months ago by
sutturka120
USA
sutturka120 wrote:

There is a parallel-fastq-dump utility available which might be useful. I am yet to test the performance and update the answer soon.

ADD COMMENTlink written 8 months ago by sutturka120

Given you have no I/O problems, it is a very nice wrapper around fastq-dump. There is now also fasterq-dump available in the current SRAtoolkit. Did not test it yet.

ADD REPLYlink written 7 weeks ago by ATpoint8.0k
0
gravatar for Minstein
7 weeks ago by
Minstein0
Minstein0 wrote:

seq 5260 5274 | parallel -j 8 wget -P ~/GSE62129 ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR160/SRR160{}/SRR160{}.sra

Refer to: https://www.slashroot.in/how-run-multiple-commands-parallel-linux

ADD COMMENTlink written 7 weeks ago by Minstein0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1120 users visited in the last hour