There are many ways for getting FASTQ data out of the short-read archive. I have spent a few hours today investigating the various approaches. (Edit: see also the many alternatives posted as followups)
- if you want a subset of the reads say 1000 reads use
fastq-dump -X 1000 SRR14575325
- if you want the entire file use
- if you want to be in full control find the URLs, then use
curlto get the data
- if you feel lucky use
We will be downloading the file
Note that some methods store (cache) files thus store both the SRA and FASTQ files. For those tools subsequent FASTQ conversions will perform faster. I am cleaning the cache in my examples only to ensure that I correctly measure the performance.
Some examples require tools from sratools, to install them use:
# Currently installs version 2.9 conda install sratools
or visit the webpage and download binaries:
Currently, the latest version for sratools is 2.11
Use fastq-dump directly:
# Clean the cache rm -f ~/ncbi/public/sra/SRR14575325* # Convert ten reads time fastq-dump SRR14575325 -X 10 # 1 seconds # Convert all reads time fastq-dump SRR14575325 # 5 minutes
Total time 5 minutes.
fastq-dump will stores the SRA file a cache folder. On my system is located in
fastq dump on the same accession will take 1 minute. The principal advantage of
fastq-dump over all other methods is that it supports the partial download of data.
fasterq-dump is the future replacement for
fastq-dump. According to the documentation, it requires up to 10x as much disk space as the final file. In addition, it does not yet support downloading a subset of the data as fastq-dump does:
# Clean your cache file rm -f ~/ncbi/public/sra/SRR14575325* # Convert all reads time fasterq-dump -f SRR14575325 # 1.1 minutes
Total time 1 minute.
fasterq-dump also stores the data in the cache as:
Subsequent runs take 30 seconds
Download the SRA file directly
The challenge here is to find the proper URLs. For example the SRA file URL is located in the 10th column of the output that you get with:
# Find the URL efetch -db sra -id SRR14575325 -format runinfo | cut -f 10 -d ,
Download an SRA file locally and use that:
# Clean your cache file rm -f ~/ncbi/public/sra/SRR14575325* URL1=https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos3/sra-pub-run-25/SRR14575325/SRR14575325.1 time wget $URL1 # 52 seconds fastq-dump SRR14575325.1 # 1 minute
Total time 2 minutes. As before we have both an SRA and FASTQ files.
prefetch command will download an SRA then store it in a cache directory. The behavior of
prefetch has changed, versions before 2.10 will download files into the cache directory. Versions 2.10 and above will download the files into the local directory.
The new versions of
prefetch do not operate seamlessly with
fastq-dump anymore. For versions under 2.10 the two commands:
prefetch SRR14575325 fastq-dump SRR14575325
would both make use of the same files. Alas with the new version, you would need to run them like so:
prefetch SRR14575325 fastq-dump SRR14575325/SRR14575325.sra
¯\_(ツ)_/¯ ... all in the name of progress I guess. Just remember that commands and examples in training materials may not work correctly anymore. Some people claim that
prefetch can download fastq files with
prefetch --type fastq SRR14575325
but when I tried it I got:
2021-10-14T18:01:00 prefetch.2.11.2 err: name not found while resolving query within virtual file system module - failed to resolve accession 'SRR1972739' - no data ( 404 )
getting these weird errors with sratools is not uncommon. Various fixes exist (Google for them) yet no solution seems reliable enough, see: https://github.com/ncbi/sra-tools/issues/35
If you get this error, try some fixes or just pick a different method from the list.
But let's continue the journey; we ran the commands below with version 2.9:
# Clean the cache rm -f ~/ncbi/public/sra/SRR14575325* time prefetch SRR14575325 # 57 seconds # Convert ten reads time fastq-dump SRR14575325 -X 10 # 0 seconds # Convert all reads time fastq-dump SRR14575325 # 1 minute
Total time of 2 minutes. Stores the SRA file in the cache under the name:
Subsequent conversions with
fastq-dump will take 1 minute since it uses the cache file.
Download from EBI
to find the EBI link to an SRA file use:
curl -X GET "https://www.ebi.ac.uk/ena/portal/api/filereport?accession=SRR14575325&fields=fastq_ftp&result=read_run"
run_accession fastq_ftp SRR14575325 ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/025/SRR14575325/SRR14575325.fastq.gz
Let's use the EBI link:
URL=https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/025/SRR14575325/SRR14575325.fastq.gz wget $URL
The download was slow, estimated time of 15 minutes, I did not wait to finish. The next day I tried again the download seemed much faster under a minute. Your mileage may vary.
Question: I just need to make sure of something, you clean the cache here
rm -f ~/ncbi/public/sra/SRR14575325*even though you did not download the file before, just in case? Does it really affect the downloading time significantly?
It is generally a good idea to clean the cache from time to time. Often, we are using a large partition with much larger space/quota to download data for analysis, but with the default setting, the cached files end up in our home directories anyway and clutter it up. Either you configure the toolkit to use a different directory, or check regularly.
Quick questions: Is it possible that if you are downloading +10 gigabytes of RNA-seq data from the NCBI SRA archive using "fastq-dump" or "fasterq-dump", over a wifi connection, that you can possibly not acquire the total data? Is there a command line to check the integrity of the data? If we previously downloaded the same data, is it a smart move to clean the cache after we deleted those files?
WiFi connection for that much of data sounds like a bad idea but if you have great upstream connectivity (e.g. fiber) and a new WiFi6 then it may be ok .. as long as you are patient.
vdb-validateprogram included in
sratoolkitwill allow you to validate the downloaded data.