3
1
Entering edit mode
4.8 years ago
Bioinfonext ▴ 420

Hi,

How I can use GNU parallel to download SRA files fast in the below command:

nohup /mnt//sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump --split-3 --gzip SRR1785709 SRR1785715 SRR1785721 SRR1785728 SRR1785734 SRR1785742 SRR1785744 >nohup.out &

RNA-Seq • 2.9k views
1
Entering edit mode

If possible use EBI-ENA to get the fastq files directly.

Consider that you may be saturating incoming bandwidth on the network connection (once you get this to work). If you are on a shared machine/cluster that can cause issues for others.

1
Entering edit mode

@Pierre's parallel tutorial.

3
Entering edit mode
4.8 years ago
tiago211287 ★ 1.4k

As the sizes of the datasets have increased, we have found that the traditional methods of FTP or HTTP do not have the performance characteristics needed to support this load of data. FTP performance degrades proportionally with the number of hops or switches the data must take to get to you. Aspera performance does not degrade with distance. Aspera is typically 10 times faster than FTP and reduces the chance of drops or time-outs in the middle of a transfer. Best-case transfer rates for ascp are ~ 600 Mbps, while typical rates are closer to 100-200 Mbps. [Aspera Transfer Guide][1]

Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

Inside the server/cluster do:

execute the shell script:

./aspera-connect-Version-linux-64.sh


aspera will be put on the path:

$HOME/.aspera/connect/bin/  Make a text file with the accession IDs, one way is to cat into a empty file and paste. End cat with CtrL+D: cat > accessions.txt SRR1346053 SRR1346054 SRR1346055 SRR1346056 SRR1346057 SRR1346058 SRR1346059  Use GNU parallel parallel --max-procs 1 --xapply ascp -v -k 1 -l50m -i$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR134/{1}/{1}.sra $HOME/OUTPUT/FOLDER/ :::: accessions.txt  Explanation --max-procs 1 -> allows the download of only 1 item at a time. -v -> verbose mode -k 1 -> allow you to restart incomplete transfers -l50m -> limits the band to 50Mbps (~5Mb/second) -i asperaweb_id_dsa.openssh, public key {SRR|ERR|DRR} should be either ‘SRR’, ‘ERR’, or ‘DRR’ and should match the prefix of the target .sra file path to the files: /sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6="" characters="" of="" accession="">/<accession>/<accession>.sra Transform all sras in raw fastq files: find$PWD -name "*.sra" | parallel --maxprocs N fastq-dump --split-files {1}


N = number of simulteneous instances (maximum number of cores to process requests).

2
Entering edit mode

While all this is great information, OP (Bioinfonext ) should definitely talk with local cluster admins before doing this. It could put a lot of load on the head node (if run there) and/or gum up the network (such that no one else may be able to do anything).

0
Entering edit mode

You are totally right. I made it myself when I was learning. A way of using without upsetting coworkers is limit the network band with less or equal to 50 Mbps "-l50m" and always set the --maxprocs parameter in parallel to a low value.

Talk with the admin is a good idea.

0
Entering edit mode

If you're adding max-procs and setting it to a single thread - there's no point parallelising...? (Concerns about OP sucking up all the bandwidth aside).

0
Entering edit mode

Actually, there is, using max-procs 1, you avoid loops.

0
Entering edit mode

0
Entering edit mode
4.8 years ago
sutturka ▴ 180

There is a parallel-fastq-dump utility available which might be useful. I am yet to test the performance and update the answer soon.

0
Entering edit mode

Given you have no I/O problems, it is a very nice wrapper around fastq-dump. There is now also fasterq-dump available in the current SRAtoolkit. Did not test it yet.

0
Entering edit mode
4.3 years ago
Min Dai ▴ 160

seq 5260 5274 | parallel -j 8 wget -P ~/GSE62129 ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR160/SRR160{}/SRR160{}.sra