parallel downloads from SRA with SRA toolkit or other ways to speed up downloads
0
0
Entering edit mode
22 months ago
ptellier • 0

Is there a way to parallelize downloads from NCBI using SRAToolkit on a HPC cluster? I tried using GNU parallel but I can not actually tell if the downloads are doing anything:

cat < /home/ptellier/scratch/phillip/data/escc_data/SRA_accessions.txt | parallel -j 4 fasterq-dump --threads 4 --progress {}

Unlike the regular command fasterq-dump --progress I can't see any progress output when I parallelize it.

So far when I run the downloads in a for loop, the download is a few megabytes/s and there is around 800 gb to download into the cluster:

for d in $SRA_DOWNLOADS
do
   echo "downloading $d from sequence read archive"
   fasterq-dump --threads 4 --progress $d
done

Is there anything else I can do to speed up these large downloads?

This is the data from SRA run selector that I was trying to access: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA672851&o=acc_s%3Aa

SRAToolkit HPC Bash SRA • 1.2k views
ADD COMMENT
0
Entering edit mode

Ultimately there is a bandwidth limit that is in play. It could be in one or more places. Your local HPC admins may be throttling how much bandwidth you can use, it could be your school/institution as a whole or it could be a firewall that is inspecting all packets going in and out of your network. Finally NCBI probably limits how much bandwidth each IP address can use. If you don't have a performant storage system, the disk writes may not be keeping up, slowing the downloads down.

Have you considered the possibility that you may already be getting the best speeds possible under the circumstances. With large downloads you will need to be patient. It may take a couple of days to complete the downloads.

You could also try: https://github.com/nf-core/fetchngs

ADD REPLY
0
Entering edit mode

This is not for the parallel download option but, you can also use SRA Explorer (https://sra-explorer.info/#). This website can create for you a bash script to download all samples in the GEO.

If you have a workflow management system in your HPC cluster like Slurm, you can also submit a job for every fastq file for the parallel download.

ADD REPLY

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6