Entering edit mode
2.5 years ago
melissachua90
▴
70
I want to download the SRP325386
dataset for a class deep sequencing analysis project. It contains 199898 samples.
I used the following command:
esearch -db sra -query SRP325386 | efetch -format runinfo | cut -d ',' -f 1 | grep SRR | xargs -n 1 -P 4 fastq-dump --split-3 --gzip --skip-technical --readids -W --read-filter pass
After 12 hours, it's still downloading (5152 items thus far).
Did I use the wrong command? Is there a more efficient way to download datasets from SRA?
You could try to replace
fastq-dump
with fasterq-dump. Alternatively, have a look at nf-core fetchngs, which makes the download even more convenient, because it parallelizes the downloading whereas your script will download sequentially.Thanks! Another trivial question. What constitutes a SRA "Dataset"? Is it denoted with a "SRP" prefix?
SRA is the Sequence Read Archive operated by the NCBI. There are projects (SRP), runs (SRR), experiments (SRX) and samples (SAMN). Typically, one will want to download all the data of a study/project, but also arbitrary subsets of those are possible.
fetchngs should be able to resolve whatever SRA ID it is provided with and download the accompanying data. Getting started is initially a bit more work, but once you know how to run nextflow and the nf-core pipelines, they will be a huge time-saver plus you usually get results according to the best practices.
You have a huge number of samples. Are you sure you actually want to download them all? I hope you have enough storage/bandwidth locally since you may run out of one of them before SRA does :-)