Question

Missing GenBank identifier to NCBI taxonomy mapping

1

Entering edit mode

4.5 years ago

martinsteinegger ▴ 40

I have a problem to map GenBank identifiers to their NCBI taxonomical identifiers. I used the accession2taxid file (ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/) for the mapping. However out of my 400 million GenBank identifier only 150 million can be mapped to a taxonomical id. For example, the identifier QOUH01147937 is not in any mapping file.

$ zgrep QOUH01147937 dead_nucl.accession2taxid.gz  dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz  nucl_wgs.accession2taxid.gz
$

At the same time eutils seems to be able to map the accession. How does eutils get it right and how can I locally replicate its lookup?

$ wget -O -  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=QOUH01147937&rettype=fasta&retmode=xml"  2> /dev/null


https://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeqSet>
<TSeq>
  <TSeq_seqtype value="nucleotide"/>
  <TSeq_accver>QOUH01147937.1</TSeq_accver>
  <TSeq_sid>gnl|WGS:QOUH01|Supernova_9898_2</TSeq_sid>
<TSeq_taxid>181123</TSeq_taxid>
…
</TSeqSet>

ncbi taxonomy genbank • 1.6k views

ADD COMMENT • link updated 4.5 years ago by h.mon 35k • written 4.5 years ago by martinsteinegger ▴ 40

score 1 · Answer 1 · 2019-10-15

1

Entering edit mode

4.5 years ago

h.mon 35k

The accession in particular you are looking for (QOUH01147937) is part of a whole genome shotgun sequencing project assembly. The GenBank page of the project has this commnet:

COMMENT     The Austropuccinia psidii whole genome shotgun (WGS) project has
            the project accession QOUH00000000.  This version of the project
            (01) has the accession number QOUH01000000, and consists of
            sequences QOUH01000001-QOUH01147937.

Searching for the _base_ (?) accession returns the correct information:

grep "QOUH00000000" nucl_wgs.accession2taxid

QOUH00000000 QOUH00000000.1 181123 1511945653

This genome assembly has ~150k contigs. I suspect the million missing accessions suffer from this same problem - you just need some thousand similar assemblies to explain 150m missing records.

ADD COMMENT • link 4.5 years ago by h.mon 35k

0

Entering edit mode

Thanks a lot for your answer! This explains why I could not grep for the accession. However, I still do not know how to efficiently map the accessions to a taxonomical accession. I guess I need to query the project accession but this is a different problem.

ADD REPLY • link 4.5 years ago by martinsteinegger ▴ 40

1

Entering edit mode

I was able to put something together with the help of EDirect_EUtils_API_Cookbook:

esearch -db nuccore -query "QOUH01147937"  \
  | efetch -format docsum \
  | xtract -pattern DocumentSummary -element TaxId

However, epost (for batch queries) works with the _base_ accession, but doesn't work with the version accessions:

echo "QOUH00000000" \
   | epost -db nuccore -format acc \
   | efetch -format docsum \
   | xtract -pattern DocumentSummary -element TaxId

echo "QOUH01147937" \
   | epost -db nuccore -format acc \
   | efetch -format docsum \
   | xtract -pattern DocumentSummary -element TaxId

ERROR in count output: Empty result - nothing to do
URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&query_key=1&WebEnv=NCID_1_157744410_130.14.18.97_9001_1571189278_99587227_0MetA0_S_MegaStore&retmax=0&usehistory=y&edirect=6.10&tool=entrez-direct-count&email=program2@biotec01.esalq.usp.br+
Db value not found in summary input

I am pinging Pierre Lindenbaum, genomax and vkkodali because they are EDirect wizards and may have suggestions. Another option would be to open an issue at the EDirect_EUtils_API_Cookbook repository.

ADD REPLY • link 4.5 years ago by h.mon 35k