I am trying to download all transcriptome shotgun and whole genome shotgun assemblies from NCBI Trace archive given a taxon (e.g. all arthropods). I have tried using eutils. An example query (Asellus aquaticus) I am using is:
((txid92525[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))
This will yield all TSA and WGS master entries in nuccore, e.g.:
./esearch -db nuccore -query '((txid92525[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))' | ./efetch -format docsum
Now, I need the link to the Trace entry, for each search result. If we look at an example entry on the web: https://www.ncbi.nlm.nih.gov/nuccore/GDKY00000000.1 , at the bottom of the page, there is a link like this: https://www.ncbi.nlm.nih.gov/Traces/wgs?val=GDKY01 with the trace identifier: GDKY01
TSA GDKY01000001-GDKY01021684
- I am unable to extract this link from the efetch result. How I can I get the ftp URL?
- Is the id always the first 6 characters of the TSA ranges?
The following solution works but is too slow, because it downloads each contig sequence separately while there is a ready fasta file on ftp:
./esearch -query 924393409 -db nuccore | ./elink -target nuccore -name nuccore_nuccore_mstr2mbr | ./efetch -format fasta
Thank you! Especially important is the hint that the Trace browser works with any of the transcript ids, not only the shortened ones. The same seems to be true for SRA toolkit, so one can maybe even use
fastq-dump GDKY01000001
to get the download.I just compared the output of
fastq-dump -F --fasta GDKY01000001
by diff with the ftp download and they are identical. Maybe an easier and more reliable way than to construct the ftp URL in a script?