Question: Missing GenBank identifier to NCBI taxonomy mapping
1
gravatar for martinsteinegger
7 weeks ago by
martinsteinegger30 wrote:

I have a problem to map GenBank identifiers to their NCBI taxonomical identifiers. I used the accession2taxid file (ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/) for the mapping. However out of my 400 million GenBank identifier only 150 million can be mapped to a taxonomical id. For example, the identifier QOUH01147937 is not in any mapping file.

$ zgrep QOUH01147937 dead_nucl.accession2taxid.gz  dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz  nucl_wgs.accession2taxid.gz
$

At the same time eutils seems to be able to map the accession. How does eutils get it right and how can I locally replicate its lookup?

$ wget -O -  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=QOUH01147937&rettype=fasta&retmode=xml"  2> /dev/null


https://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeqSet>
<TSeq>
  <TSeq_seqtype value="nucleotide"/>
  <TSeq_accver>QOUH01147937.1</TSeq_accver>
  <TSeq_sid>gnl|WGS:QOUH01|Supernova_9898_2</TSeq_sid>
<TSeq_taxid>181123</TSeq_taxid>
…
</TSeqSet>
genbank taxonomy ncbi • 199 views
ADD COMMENTlink modified 7 weeks ago by h.mon28k • written 7 weeks ago by martinsteinegger30
1
gravatar for h.mon
7 weeks ago by
h.mon28k
Brazil
h.mon28k wrote:

The accession in particular you are looking for (QOUH01147937) is part of a whole genome shotgun sequencing project assembly. The GenBank page of the project has this commnet:

COMMENT     The Austropuccinia psidii whole genome shotgun (WGS) project has
            the project accession QOUH00000000.  This version of the project
            (01) has the accession number QOUH01000000, and consists of
            sequences QOUH01000001-QOUH01147937.

Searching for the _base_ (?) accession returns the correct information:

grep "QOUH00000000" nucl_wgs.accession2taxid

QOUH00000000 QOUH00000000.1 181123 1511945653

This genome assembly has ~150k contigs. I suspect the million missing accessions suffer from this same problem - you just need some thousand similar assemblies to explain 150m missing records.

ADD COMMENTlink written 7 weeks ago by h.mon28k

Thanks a lot for your answer! This explains why I could not grep for the accession. However, I still do not know how to efficiently map the accessions to a taxonomical accession. I guess I need to query the project accession but this is a different problem.

ADD REPLYlink written 7 weeks ago by martinsteinegger30
1

I was able to put something together with the help of EDirect_EUtils_API_Cookbook:

esearch -db nuccore -query "QOUH01147937"  \
  | efetch -format docsum \
  | xtract -pattern DocumentSummary -element TaxId
181123
  

However, epost (for batch queries) works with the _base_ accession, but doesn't work with the version accessions:

echo "QOUH00000000" \
   | epost -db nuccore -format acc \
   | efetch -format docsum \
   | xtract -pattern DocumentSummary -element TaxId
181123
  
echo "QOUH01147937" \
   | epost -db nuccore -format acc \
   | efetch -format docsum \
   | xtract -pattern DocumentSummary -element TaxId
ERROR in count output: Empty result - nothing to do
URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&query_key=1&WebEnv=NCID_1_157744410_130.14.18.97_9001_1571189278_99587227_0MetA0_S_MegaStore&retmax=0&usehistory=y&edirect=6.10&tool=entrez-direct-count&email=program2@biotec01.esalq.usp.br+
Db value not found in summary input
  

I am pinging Pierre Lindenbaum, genomax and vkkodali because they are EDirect wizards and may have suggestions. Another option would be to open an issue at the EDirect_EUtils_API_Cookbook repository.

ADD REPLYlink written 7 weeks ago by h.mon28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 710 users visited in the last hour