Question: The source of NCBI NR library
0
gravatar for huangjs2017
9 months ago by
huangjs20170 wrote:

I need obtain taxonomy information(taxon id) of NCBI NR library by protein accession number. I find two useful files prot.accession2taxid.gz and pdb.accession2taxid.gz in https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/. However, some accession numbers still cannot fetch taxonomy information. Those accession numbers mainly are consist of the following categories:

  1. The NCBI show "Record removed", like "AYN07615.1". Why did the records removed appear in the NR library?

  2. Some accession numbers from unknown resources. For example, pir||S69889 and prf||1403304A.

  3. Some accession numbers from PDB, but those cannot be found in pdb.accession2taxid.gz. For example 6F1U_FF

how can I obtain taxonomy information for those special accession numbers?

next-gen assembly sequence • 315 views
ADD COMMENTlink modified 9 months ago • written 9 months ago by huangjs20170

which version of blast/nr are you using ( local copy?) ? Or are you simply looking for the list of all taxonomy for each protein?

ADD REPLYlink written 9 months ago by lieven.sterck6.7k

I download the NR library from https://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz.

And it can be said that I am simply looking for the list of all taxonomy for each protein. But I cannot obtain all taxonomy for each protein from the headers in NR fasta file because of some non-standard naming and possible duplicate taxa name (a taxon name can map multiple taxa ids) .

ADD REPLYlink written 9 months ago by huangjs20170

the 'removed' record might be because the version you can download is always a little bit behind compared to the online version (== normally you can check when it has been removed, and I would not be surprised if dates after the time you downloaded nr from NCBI ).

PIR and PRF are not unknown resources, lesser known OK. Normally they both (or at least PIR) is nowadays included in UNIprot

for the PDB one you have to search for 6F1U I think (the _FF denotes the chain )

ADD REPLYlink written 9 months ago by lieven.sterck6.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 959 users visited in the last hour