Question: wonderdump.sh and FTP site URL conventions for SRR identifiers
0
gravatar for manuel.belmadani
13 days ago by
Canada
manuel.belmadani30 wrote:

I've been using wonderdump.sh from the Biostars handbook for some time. I'm now curious about the part that builds the ftp site url:

PATH1=${SRR:0:6}
PATH2=${SRR:0:10}
URL="ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/${PATH1}/${PATH2}/${SRR}.sra"

I've seen SRR ids be either of length 10 or length 9, so PATH2 is effectively the full SRR id in both these conditions.

Examples:
ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR000/SRR000001/SRR000001.sra
ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR213/SRR2138040/SRR2138040.sra

I'm curious to know why this is done explicitly like this. Is it established that the "PATH2" portion of the ftp URLs will always be at most 10 characters long, in anticipation of length-11 SRR ids? As in, if an SRR id longer than 10 characters ever comes into use, then the PATH2 part should be the SRR id truncated at 10 characters? If that's the case, could someone point me to a reference where this convention is described?

If that's not the case or part of any known specification, then wouldn't an 11-character long identifier break wonderdump.sh?

Much appreciated!

wonderdump sra convention ftp ncbi • 121 views
ADD COMMENTlink modified 11 days ago • written 13 days ago by manuel.belmadani30

What NCBI may or may not do in future is speculative. But as of now there are finite directories at the PATH1 level and those include SRR000 to SRR999 (some of the directories are still empty so there is room for growth). If NCBI does start using longer ID's in PATH2, it would be a simple change to account for that.

As I recall wonderdump.sh was specifically written to allow SRA downloads to work on linux subsystem on Windows 10.

In general, EBI-ENA should be your first stop to download fastq format sequence data. This avoids having to deal with SRA and its related inconveniences.

ADD REPLYlink modified 13 days ago • written 13 days ago by genomax50k

Thanks for the reply!

I agree that what NCBI choses to do is speculative, however, wonderdump.sh explicitly truncates the SRR portion at 10 characters to create PATH2, when currently there's no difference between doing this and returning the full SRR. I'm assuming there's a good reason for this possibly based on some convention, or else the script would/could just use the SRR.

You're right that the script used to mention something about being a workaround for Windows Bash, but it's been updated to:
Wonderdump is a workaround to download SRA files directly when fastq-dump's internet connection does not work. Which can happen surprisingly frequently. Which is true at least in my experience and the reason why we're getting .sra files with this method.

I agree EBI-ENA is more straightforward, however my requirement is to download datasets from SRA, so it's not up to me at this point. I'm maintaining a pipeline that uses wonderdump.sh, so I would like to future proof this as much as possible.

ADD REPLYlink modified 13 days ago • written 13 days ago by manuel.belmadani30
1

I don't think there is a published convention. Here are some examples of 9 character SRR ID's. Try to see what happens with these. Istvan may have used examples of 10 character SRR ID's in the handbook and thus in wonderdump.sh.

ADD REPLYlink modified 13 days ago • written 13 days ago by genomax50k
1
gravatar for manuel.belmadani
11 days ago by
Canada
manuel.belmadani30 wrote:

I got a reply from an SRA curator:

The PATH2 is intended to be the full SRA Run accession, and is not restricted to a character limit.

So it appears that doing PATH2=${SRR:0:10} is not necessary or requested by any convention, and would probably indeed break if 11-character long identifiers ever come into use. The rest of the e-mail pasted below also suggests that the FTP site may not be available in the future:

However, as the SRA database grows to a very large, this avenue for getting SRA Run files becomes more difficult to maintain. The SRA will provide support for the ByRun and ByStudy FTP paths to accessions for now, but our systems group predicts that it may not be able to support it at some point in the future and suggests using the SRA toolkit to access Runs (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc).

For my purposes, I'll update wonderdump.sh to make sure it doesn't silently truncate PATH2 at 10 characters, or at least explicitly raise an error if it runs into an identifier longer than 10 characters.

ADD COMMENTlink written 11 days ago by manuel.belmadani30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 974 users visited in the last hour