Is there a way to parallelize downloads from NCBI using SRAToolkit on a HPC cluster? I tried using GNU parallel but I can not actually tell if the downloads are doing anything:
cat < /home/ptellier/scratch/phillip/data/escc_data/SRA_accessions.txt | parallel -j 4 fasterq-dump --threads 4 --progress {}
Unlike the regular command fasterq-dump --progress
I can't see any progress output when I parallelize it.
So far when I run the downloads in a for loop, the download is a few megabytes/s and there is around 800 gb to download into the cluster:
for d in $SRA_DOWNLOADS
do
echo "downloading $d from sequence read archive"
fasterq-dump --threads 4 --progress $d
done
Is there anything else I can do to speed up these large downloads?
This is the data from SRA run selector that I was trying to access: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA672851&o=acc_s%3Aa
Ultimately there is a bandwidth limit that is in play. It could be in one or more places. Your local HPC admins may be throttling how much bandwidth you can use, it could be your school/institution as a whole or it could be a firewall that is inspecting all packets going in and out of your network. Finally NCBI probably limits how much bandwidth each IP address can use. If you don't have a performant storage system, the disk writes may not be keeping up, slowing the downloads down.
Have you considered the possibility that you may already be getting the best speeds possible under the circumstances. With large downloads you will need to be patient. It may take a couple of days to complete the downloads.
You could also try: https://github.com/nf-core/fetchngs
This is not for the parallel download option but, you can also use SRA Explorer (https://sra-explorer.info/#). This website can create for you a bash script to download all samples in the GEO.
If you have a workflow management system in your HPC cluster like Slurm, you can also submit a job for every fastq file for the parallel download.