Missing GenBank identifier to NCBI taxonomy mapping
1
1
Entering edit mode
4.5 years ago

I have a problem to map GenBank identifiers to their NCBI taxonomical identifiers. I used the accession2taxid file (ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/) for the mapping. However out of my 400 million GenBank identifier only 150 million can be mapped to a taxonomical id. For example, the identifier QOUH01147937 is not in any mapping file.

$ zgrep QOUH01147937 dead_nucl.accession2taxid.gz  dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz  nucl_wgs.accession2taxid.gz
$

At the same time eutils seems to be able to map the accession. How does eutils get it right and how can I locally replicate its lookup?

$ wget -O -  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=QOUH01147937&rettype=fasta&retmode=xml"  2> /dev/null


https://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeqSet>
<TSeq>
  <TSeq_seqtype value="nucleotide"/>
  <TSeq_accver>QOUH01147937.1</TSeq_accver>
  <TSeq_sid>gnl|WGS:QOUH01|Supernova_9898_2</TSeq_sid>
<TSeq_taxid>181123</TSeq_taxid>
…
</TSeqSet>
ncbi taxonomy genbank • 1.6k views
ADD COMMENT
1
Entering edit mode
4.5 years ago
h.mon 35k

The accession in particular you are looking for (QOUH01147937) is part of a whole genome shotgun sequencing project assembly. The GenBank page of the project has this commnet:

COMMENT     The Austropuccinia psidii whole genome shotgun (WGS) project has
            the project accession QOUH00000000.  This version of the project
            (01) has the accession number QOUH01000000, and consists of
            sequences QOUH01000001-QOUH01147937.

Searching for the _base_ (?) accession returns the correct information:

grep "QOUH00000000" nucl_wgs.accession2taxid

QOUH00000000 QOUH00000000.1 181123 1511945653

This genome assembly has ~150k contigs. I suspect the million missing accessions suffer from this same problem - you just need some thousand similar assemblies to explain 150m missing records.

ADD COMMENT
0
Entering edit mode

Thanks a lot for your answer! This explains why I could not grep for the accession. However, I still do not know how to efficiently map the accessions to a taxonomical accession. I guess I need to query the project accession but this is a different problem.

ADD REPLY
1
Entering edit mode

I was able to put something together with the help of EDirect_EUtils_API_Cookbook:

esearch -db nuccore -query "QOUH01147937"  \
  | efetch -format docsum \
  | xtract -pattern DocumentSummary -element TaxId
181123
  

However, epost (for batch queries) works with the _base_ accession, but doesn't work with the version accessions:

echo "QOUH00000000" \
   | epost -db nuccore -format acc \
   | efetch -format docsum \
   | xtract -pattern DocumentSummary -element TaxId
181123
  
echo "QOUH01147937" \
   | epost -db nuccore -format acc \
   | efetch -format docsum \
   | xtract -pattern DocumentSummary -element TaxId
ERROR in count output: Empty result - nothing to do
URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&query_key=1&WebEnv=NCID_1_157744410_130.14.18.97_9001_1571189278_99587227_0MetA0_S_MegaStore&retmax=0&usehistory=y&edirect=6.10&tool=entrez-direct-count&email=program2@biotec01.esalq.usp.br+
Db value not found in summary input
  

I am pinging Pierre Lindenbaum, genomax and vkkodali because they are EDirect wizards and may have suggestions. Another option would be to open an issue at the EDirect_EUtils_API_Cookbook repository.

ADD REPLY

Login before adding your answer.

Traffic: 2255 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6