I have a problem to map GenBank identifiers to their NCBI taxonomical identifiers. I used the accession2taxid file (ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/) for the mapping. However out of my 400 million GenBank identifier only 150 million can be mapped to a taxonomical id. For example, the identifier QOUH01147937 is not in any mapping file.
$ zgrep QOUH01147937 dead_nucl.accession2taxid.gz dead_wgs.accession2taxid.gz nucl_gb.accession2taxid.gz nucl_wgs.accession2taxid.gz
$
At the same time eutils seems to be able to map the accession. How does eutils get it right and how can I locally replicate its lookup?
$ wget -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=QOUH01147937&rettype=fasta&retmode=xml" 2> /dev/null
https://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeqSet>
<TSeq>
<TSeq_seqtype value="nucleotide"/>
<TSeq_accver>QOUH01147937.1</TSeq_accver>
<TSeq_sid>gnl|WGS:QOUH01|Supernova_9898_2</TSeq_sid>
<TSeq_taxid>181123</TSeq_taxid>
…
</TSeqSet>
Thanks a lot for your answer! This explains why I could not grep for the accession. However, I still do not know how to efficiently map the accessions to a taxonomical accession. I guess I need to query the project accession but this is a different problem.
I was able to put something together with the help of EDirect_EUtils_API_Cookbook:
However,
epost
(for batch queries) works with the _base_ accession, but doesn't work with the version accessions:I am pinging Pierre Lindenbaum, genomax and vkkodali because they are EDirect wizards and may have suggestions. Another option would be to open an issue at the EDirect_EUtils_API_Cookbook repository.