How can we download all TSA and WGS from NCBI Trace by taxon programmatically?
2
0
Entering edit mode
5.1 years ago

I am trying to download all transcriptome shotgun and whole genome shotgun assemblies from NCBI Trace archive given a taxon (e.g. all arthropods). I have tried using eutils. An example query (Asellus aquaticus) I am using is:

((txid92525[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))


This will yield all TSA and WGS master entries in nuccore, e.g.:

./esearch -db nuccore -query '((txid92525[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))' | ./efetch -format docsum


Now, I need the link to the Trace entry, for each search result. If we look at an example entry on the web: https://www.ncbi.nlm.nih.gov/nuccore/GDKY00000000.1 , at the bottom of the page, there is a link like this: https://www.ncbi.nlm.nih.gov/Traces/wgs?val=GDKY01 with the trace identifier: GDKY01

TSA         GDKY01000001-GDKY01021684

• I am unable to extract this link from the efetch result. How I can I get the ftp URL?
• Is the id always the first 6 characters of the TSA ranges?

The following solution works but is too slow, because it downloads each contig sequence separately while there is a ready fasta file on ftp:

 ./esearch  -query 924393409 -db nuccore | ./elink -target nuccore -name nuccore_nuccore_mstr2mbr | ./efetch -format fasta

NCBI trace eutils efetch Assembly • 3.2k views
3
Entering edit mode
5.1 years ago

I'll give it a try even though I don't know eutils much, hoping probably it will give some clues to get your answer.

You can change the format in efetch to genebank, (-format gb) so that you get the same results as web sans html.

./esearch -db nuccore -query '((txid92525[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))' | ./efetch -format gb


In the end, there is the TSA line giving the start and end Accession number of WGS, which can be easily extracted by command line tools.

TSA         GDKY01000001-GDKY01021684


These are the start and end Accession numbers. This has to be queried on the trace db which has base-url as https://www.ncbi.nlm.nih.gov/Traces/wgs/?val=

So your queries should look something like https://www.ncbi.nlm.nih.gov/Traces/wgs/?val=GDKY01000001

It seems to me that even querying with TSA-start suffices, but you can check further. HTH!

1
Entering edit mode

Thank you! Especially important is the hint that the Trace browser works with any of the transcript ids, not only the shortened ones. The same seems to be true for SRA toolkit, so one can maybe even use fastq-dump GDKY01000001 to get the download.

0
Entering edit mode

I just compared the output of fastq-dump -F --fasta GDKY01000001 by diff with the ftp download and they are identical. Maybe an easier and more reliable way than to construct the ftp URL in a script?

1
Entering edit mode
5.1 years ago

Here is a shell script to automatize this:

#!/bin/sh

set -eu
TAX=$1 RESULT=esearch -db nuccore -query '((txid'${TAX}'[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))' | \
efetch -format xml | tee ${TAX}.esearch.xml # save the xml result for further reference ID=echo$RESULT | xtract -pattern Seq-entry  -element Textseq-id_name  # extracts the seq-id name field from the tsa or wgs master
for I in $ID ; do echo Downloading$I ...
if [ -e $I.fasta ] then echo " skipping because file exists." continue # skip if the file has been downloaded already fi fastq-dump -fasta -F$I # use SRA toolkit fastq-dump with the option to make a fasta file and the standard header.
done


Depends on e-utils and SRA toolkit which need to be installed and in PATH. Call:

  ./fetchAllTsaByTaxon.sh Taxid