Question: How to use GNU parallel to download SRA files
1
gravatar for Bioinfonext
4 months ago by
Bioinfonext100
Korea
Bioinfonext100 wrote:

Hi,

How I can use GNU parallel to download SRA files fast in the below command:

nohup /mnt//sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump --split-3 --gzip SRR1785709 SRR1785715 SRR1785721 SRR1785728 SRR1785734 SRR1785742 SRR1785744 >nohup.out &
rna-seq • 363 views
ADD COMMENTlink modified 4 months ago by sutturka120 • written 4 months ago by Bioinfonext100
1

If possible use EBI-ENA to get the fastq files directly.

Consider that you may be saturating incoming bandwidth on the network connection (once you get this to work). If you are on a shared machine/cluster that can cause issues for others.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax49k

@Pierre's parallel tutorial.

ADD REPLYlink written 4 months ago by genomax49k
2
gravatar for tiago211287
4 months ago by
tiago211287990
USA
tiago211287990 wrote:

As the sizes of the datasets have increased, we have found that the traditional methods of FTP or HTTP do not have the performance characteristics needed to support this load of data. FTP performance degrades proportionally with the number of hops or switches the data must take to get to you. Aspera performance does not degrade with distance. Aspera is typically 10 times faster than FTP and reduces the chance of drops or time-outs in the middle of a transfer. Best-case transfer rates for ascp are ~ 600 Mbps, while typical rates are closer to 100-200 Mbps. [Aspera Transfer Guide][1]

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

Inside the server/cluster do:

Download Aspera Connect to Linux

execute the shell script:

./aspera-connect-Version-linux-64.sh

aspera will be put on the path:

$HOME/.aspera/connect/bin/

Make a text file with the accession IDs, one way is to cat into a empty file and paste. End cat with CtrL+D:

cat > accessions.txt
SRR1346053
SRR1346054
SRR1346055
SRR1346056
SRR1346057
SRR1346058
SRR1346059

Use GNU parallel

parallel  --max-procs 1 --xapply ascp -v -k 1 -l50m -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR134/{1}/{1}.sra $HOME/OUTPUT/FOLDER/ :::: accessions.txt

Explanation

--max-procs 1 -> allows the download of only 1 item at a time. -v -> verbose mode

-k 1 -> allow you to restart incomplete transfers

-l50m -> limits the band to 50Mbps (~5Mb/second)

-i asperaweb_id_dsa.openssh, public key

{SRR|ERR|DRR} should be either ‘SRR’, ‘ERR’, or ‘DRR’ and should match the prefix of the target .sra file

path to the files: /sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6="" characters="" of="" accession="">/<accession>/<accession>.sra

Transform all sras in raw fastq files:

find $PWD -name "*.sra" | parallel --maxprocs N fastq-dump --split-files {1}

N = number of simulteneous instances (maximum number of cores to process requests).

ADD COMMENTlink modified 4 months ago by genomax49k • written 4 months ago by tiago211287990
2

While all this is great information, OP (Bioinfonext ) should definitely talk with local cluster admins before doing this. It could put a lot of load on the head node (if run there) and/or gum up the network (such that no one else may be able to do anything).

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax49k

You are totally right. I made it myself when I was learning. A way of using without upsetting coworkers is limit the network band with less or equal to 50 Mbps "-l50m" and always set the --maxprocs parameter in parallel to a low value.

Talk with the admin is a good idea.

ADD REPLYlink modified 4 months ago • written 4 months ago by tiago211287990

If you're adding max-procs and setting it to a single thread - there's no point parallelising...? (Concerns about OP sucking up all the bandwidth aside).

ADD REPLYlink written 4 months ago by jrj.healey4.6k

Actually, there is, using max-procs 1, you avoid loops.

ADD REPLYlink written 4 months ago by tiago211287990
0
gravatar for sutturka
4 months ago by
sutturka120
USA
sutturka120 wrote:

There is a parallel-fastq-dump utility available which might be useful. I am yet to test the performance and update the answer soon.

ADD COMMENTlink written 4 months ago by sutturka120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1308 users visited in the last hour