How can we download all TSA and WGS from NCBI Trace by taxon programmatically?
2
0
Entering edit mode
7.5 years ago
Michael 54k

I am trying to download all transcriptome shotgun and whole genome shotgun assemblies from NCBI Trace archive given a taxon (e.g. all arthropods). I have tried using eutils. An example query (Asellus aquaticus) I am using is:

((txid92525[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))

This will yield all TSA and WGS master entries in nuccore, e.g.:

./esearch -db nuccore -query '((txid92525[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))' | ./efetch -format docsum

Now, I need the link to the Trace entry, for each search result. If we look at an example entry on the web: https://www.ncbi.nlm.nih.gov/nuccore/GDKY00000000.1 , at the bottom of the page, there is a link like this: https://www.ncbi.nlm.nih.gov/Traces/wgs?val=GDKY01 with the trace identifier: GDKY01

TSA         GDKY01000001-GDKY01021684
  • I am unable to extract this link from the efetch result. How I can I get the ftp URL?
  • Is the id always the first 6 characters of the TSA ranges?

The following solution works but is too slow, because it downloads each contig sequence separately while there is a ready fasta file on ftp:

 ./esearch  -query 924393409 -db nuccore | ./elink -target nuccore -name nuccore_nuccore_mstr2mbr | ./efetch -format fasta
NCBI trace eutils efetch Assembly • 4.3k views
ADD COMMENT
3
Entering edit mode
7.5 years ago

I'll give it a try even though I don't know eutils much, hoping probably it will give some clues to get your answer.

You can change the format in efetch to genebank, (-format gb) so that you get the same results as web sans html.

./esearch -db nuccore -query '((txid92525[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))' | ./efetch -format gb

In the end, there is the TSA line giving the start and end Accession number of WGS, which can be easily extracted by command line tools.

TSA         GDKY01000001-GDKY01021684

These are the start and end Accession numbers. This has to be queried on the trace db which has base-url as https://www.ncbi.nlm.nih.gov/Traces/wgs/?val=

So your queries should look something like https://www.ncbi.nlm.nih.gov/Traces/wgs/?val=GDKY01000001

It seems to me that even querying with TSA-start suffices, but you can check further. HTH!

ADD COMMENT
1
Entering edit mode

Thank you! Especially important is the hint that the Trace browser works with any of the transcript ids, not only the shortened ones. The same seems to be true for SRA toolkit, so one can maybe even use fastq-dump GDKY01000001 to get the download.

ADD REPLY
0
Entering edit mode

I just compared the output of fastq-dump -F --fasta GDKY01000001 by diff with the ftp download and they are identical. Maybe an easier and more reliable way than to construct the ftp URL in a script?

ADD REPLY
1
Entering edit mode
7.5 years ago
Michael 54k

Here is a shell script to automatize this:

#!/bin/sh

set -eu
TAX=$1
RESULT=`esearch -db nuccore -query '((txid'${TAX}'[Organism:exp]) AND ( "tsa master"[Properties] OR "wgs master"[Properties] ))' | \
 efetch -format xml | tee ${TAX}.esearch.xml` # save the xml result for further reference
 ID=`echo $RESULT | xtract -pattern Seq-entry  -element Textseq-id_name`  # extracts the seq-id name field from the tsa or wgs master
for I in $ID ; do
  echo Downloading $I ...
  if [ -e $I.fasta ]
  then
    echo " skipping because file exists."
    continue # skip if the file has been downloaded already
   fi
   fastq-dump -fasta -F $I # use SRA toolkit fastq-dump with the option to make a fasta file and the standard header.
done

Depends on e-utils and SRA toolkit which need to be installed and in PATH. Call:

  ./fetchAllTsaByTaxon.sh Taxid
ADD COMMENT

Login before adding your answer.

Traffic: 1870 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6