I am trying to download all transcriptome shotgun and whole genome shotgun assemblies from NCBI Trace archive given a taxon (e.g. all arthropods). I have tried using eutils. An example query (Asellus aquaticus) I am using is:
((txid92525[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))
This will yield all TSA and WGS master entries in nuccore, e.g.:
./esearch -db nuccore -query '((txid92525[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))' | ./efetch -format docsum
Now, I need the link to the Trace entry, for each search result. If we look at an example entry on the web: https://www.ncbi.nlm.nih.gov/nuccore/GDKY00000000.1 , at the bottom of the page, there is a link like this: https://www.ncbi.nlm.nih.gov/Traces/wgs?val=GDKY01 with the trace identifier: GDKY01
- I am unable to extract this link from the efetch result. How I can I get the ftp URL?
- Is the id always the first 6 characters of the TSA ranges?
The following solution works but is too slow, because it downloads each contig sequence separately while there is a ready fasta file on ftp:
./esearch -query 924393409 -db nuccore | ./elink -target nuccore -name nuccore_nuccore_mstr2mbr | ./efetch -format fasta