I would like to share my twist and turns with SRA download. It took several interactions with NCBI staff to figure this out. below are the steps:
sra-toolkit and aspera plugin installed. The instructions are specific to Linux environment.
- Configure workspace
The workspace for downloading the SRA data must be cached. Although SRA-toolkit is installed centrally, this need to be set manually for every user. Please follow this link and navigate to the section Configuring the Toolkit.
Assuming sra-toolkit is installed or loaded, run the following command and complete setup as mentioned in the link.
- Download SRA file
prefetch -X 200G SRR2095320 -a "/depot/bioinfo/apps/apps/aspera-connect-18.104.22.168545/bin/ascp|/depot/bioinfo/apps/apps/aspera-connect-22.214.171.124545/etc/asperaweb_id_dsa.putty"
where "-a" specify the path for the aspera binary and private key file. Prefetch will download the SRA data as well as all needed references to your local cache. This prevents sending multiple requests to NCBI servers and save substantial time.
- Demultiplex with sra-toolkit
fastq-dump -I --split-files ./SRR2095320.sra
I found above approach extremely fast. using this approach 44GB file for SRR2095320.sra was downloaded, prefetched and converted to (151 GB R1 + 151 GB R2) data in about 25 hours using 10 processors. While using only the standard fastq-dump in >70 hours it could only convert (80 GB R1 + 80 GB R2) and job failed because of the connection issue.
modified 12 months ago
12 months ago by
sutturka • 120