I'm curious to know why this is done explicitly like this. Is it established that the "PATH2" portion of the ftp URLs will always be at most 10 characters long, in anticipation of length-11 SRR ids? As in, if an SRR id longer than 10 characters ever comes into use, then the PATH2 part should be the SRR id truncated at 10 characters? If that's the case, could someone point me to a reference where this convention is described?
If that's not the case or part of any known specification, then wouldn't an 11-character long identifier break wonderdump.sh?
What NCBI may or may not do in future is speculative. But as of now there are finite directories at the PATH1 level and those include SRR000 to SRR999 (some of the directories are still empty so there is room for growth). If NCBI does start using longer ID's in PATH2, it would be a simple change to account for that.
As I recall wonderdump.sh was specifically written to allow SRA downloads to work on linux subsystem on Windows 10.
In general, EBI-ENA should be your first stop to download fastq format sequence data. This avoids having to deal with SRA and its related inconveniences.
I agree that what NCBI choses to do is speculative, however, wonderdump.sh explicitly truncates the SRR portion at 10 characters to create PATH2, when currently there's no difference between doing this and returning the full SRR. I'm assuming there's a good reason for this possibly based on some convention, or else the script would/could just use the SRR.
You're right that the script used to mention something about being a workaround for Windows Bash, but it's been updated to:
Wonderdump is a workaround to download SRA files directly
when fastq-dump's internet connection does not work.
Which can happen surprisingly frequently.
Which is true at least in my experience and the reason why we're getting .sra files with this method.
I agree EBI-ENA is more straightforward, however my requirement is to download datasets from SRA, so it's not up to me at this point. I'm maintaining a pipeline that uses wonderdump.sh, so I would like to future proof this as much as possible.
I don't think there is a published convention. Here are some examples of 9 character SRR ID's. Try to see what happens with these. Istvan may have used examples of 10 character SRR ID's in the handbook and thus in wonderdump.sh.
The PATH2 is intended to be the full SRA Run accession, and is not
restricted to a character limit.
So it appears that doing PATH2=${SRR:0:10} is not necessary or requested by any convention, and would probably indeed break if 11-character long identifiers ever come into use. The rest of the e-mail pasted below also suggests that the FTP site may not be available in the future:
However, as the SRA database grows to a very large, this avenue for
getting SRA Run files becomes more difficult to maintain. The SRA will
provide support for the ByRun and ByStudy FTP paths to accessions for
now, but our systems group predicts that it may not be able to support
it at some point in the future and suggests using the SRA toolkit to
access Runs
(https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc).
For my purposes, I'll update wonderdump.sh to make sure it doesn't silently truncate PATH2 at 10 characters, or at least explicitly raise an error if it runs into an identifier longer than 10 characters.
What NCBI may or may not do in future is speculative. But as of now there are finite directories at the
PATH1
level and those includeSRR000 to SRR999
(some of the directories are still empty so there is room for growth). If NCBI does start using longer ID's inPATH2
, it would be a simple change to account for that.As I recall
wonderdump.sh
was specifically written to allow SRA downloads to work on linux subsystem on Windows 10.In general, EBI-ENA should be your first stop to download fastq format sequence data. This avoids having to deal with SRA and its related inconveniences.
Thanks for the reply!
I agree that what NCBI choses to do is speculative, however, wonderdump.sh explicitly truncates the SRR portion at 10 characters to create PATH2, when currently there's no difference between doing this and returning the full SRR. I'm assuming there's a good reason for this possibly based on some convention, or else the script would/could just use the SRR.
You're right that the script used to mention something about being a workaround for Windows Bash, but it's been updated to:
Wonderdump is a workaround to download SRA files directly when fastq-dump's internet connection does not work. Which can happen surprisingly frequently.
Which is true at least in my experience and the reason why we're getting.sra
files with this method.I agree EBI-ENA is more straightforward, however my requirement is to download datasets from SRA, so it's not up to me at this point. I'm maintaining a pipeline that uses wonderdump.sh, so I would like to future proof this as much as possible.
I don't think there is a published convention. Here are some examples of 9 character SRR ID's. Try to see what happens with these. Istvan may have used examples of 10 character SRR ID's in the handbook and thus in
wonderdump.sh
.