Question: Exhaustive up-to-date protein database?? nr?
gravatar for purkantiramya
5.7 years ago by
purkantiramya20 wrote:

Hello all,

I am in a fix. For my work, I need an exhaustive and up-to-date database of
all protein sequences, especially covering ALL eukaryotes sequenced till date.
Thus, I downloaded the latest version of nr database from ncbi's ftp site.
However, I find that it does not contain all the putative protein
sequences as listed in individual genome databases.

For discussion purposes, let us consider Cyanophora paradoxa (taxonomic
id: 2762). According to the website, it has 32,167 protein
coding sequences. However, there are only 731 gi ids corresponding to this
species in the latest nr database (25May2014 version). The Cyanophora
paradoxa's complete genome was published in 2012 (Price DC et al, 2012).

Thus, to me the only option seems to be to download protein sequences from individual genome projects and combining identical entries in one entry. Finally I shall append them to nr database to get the exhaustive set of proteins. However before I begin on this mammoth task, I thought to enquire if there is simpler solution to this problem.

Any help or suggestions are greatly welcomed.

Thanks a lot for your time,

nr database protein database • 2.7k views
ADD COMMENTlink modified 5.7 years ago by hpmcwill1.1k • written 5.7 years ago by purkantiramya20

I would suggest Uniprot and TrEMBL.

BTW, nice to meet you here Ramya :)

ADD REPLYlink modified 16 days ago by RamRS25k • written 5.7 years ago by Bharat Iyengar270

Thanks for the suggestion, Bharat. I agree Uniprot + TrEMBL ~= Uniparc is the closest I can get. However even that seems to list only 'complete proteomes' and not proteomes for draft genomes. Any other species and I will have to go to individual genome pages.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by purkantiramya20

You may find What Are The Proteomics Data Repositories? useful.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Bharat Iyengar270
gravatar for hpmcwill
5.7 years ago by
United Kingdom
hpmcwill1.1k wrote:

Well the current NCBI nr contains 40,337,612 sequences originating from: GenBank CDS translations (excluding those from environmental samples and WGS projects), UniProtKB/SwissProt, PDB and PRF.

UniProt's UniParc database has more coverage, currently containing 63,875,797 unique protein sequences, including the CDS translations from the INSDC databases (DDBJ, EMBL-Bank and GenBank) which are excluded from NCBI nr, and additional sequences from various other sources.

You could also try having a look at SIMAP, this contains additional protein sequences from meta-genomics experiments.

However looking at the paper (Cyanophora paradoxa genome elucidates origin of photosynthesis in algae and plants) and searching the nucleotide databases it appears that the only submission has been to the Sequence Read Archive:

The corresponding assembly and associated feature annotations do not appear to have been submitted as yet, presumably due to this currently being a draft genome. Thus the protein sequences which are derived through translation of INSDC CDS features do not appear in NCBI nr or UniProtKB.

In general you should be able to start from a non-identical sequence archive such as UniParc and add unique protein sequences from other sources, assuming you can obtain them with appropriate annotations for your purposes.

ADD COMMENTlink modified 16 days ago by RamRS25k • written 5.7 years ago by hpmcwill1.1k

Thanks a lot @hpmcwill. It's definitely a nudge in the right direction. I browsed through Uniparc dataset and it does give me "complete proteomes" for species whose genomes have been completely sequenced. Now what I am missing is proteomes for even draft genomes. Specifically in the case of cyannophora paradoxa, I could find its corresponding proteome (32,167 sequences) at the link ''. However, I was wondering if there was a database where such 'incomplete proteomes' are also listed together instead of me going for each of the individual genome pages. For now I shall proceed on your suggestion and take uniparc database as my base and build upon it. Thank you very much.

ADD REPLYlink written 5.7 years ago by purkantiramya20


There is not such "draft proteome" flag in UniProtKB. Only "complete proteome" and "reference proteome" defined here.

However @hpmcwill has a good point here, the 32,167 protein sequences you are referring to seams (to me) not been published in generic database (INSDC,RefSeq,Ensembl...), this means also they won't be neither in UniParc or UniProtKB.

Here the list of database UniParc get data from: see data source section.

Then I'm afraid the only place you can found those sequences is where you found them until they are submitted.

Sorry for not helping more

ADD REPLYlink modified 16 days ago by RamRS25k • written 5.7 years ago by Ben0

Dear Ben, Thanks for the comment. However as everyone here agrees, I agree that for specific individual genome there is no way but to go to specific pages. It seems that many of the newly sequenced genome's protein annotations are not submitted to the central repositories.

ADD REPLYlink written 5.7 years ago by purkantiramya20

You may want to contact the providers of the missing individual genomes, and ask when they plan to submit their annotated sequence data to the major resources. It may be that this step in getting the data to as many users as possible has gotten forgotten in the day-to-day work of doing research on the data and they just need to be reminded.

ADD REPLYlink written 5.7 years ago by hpmcwill1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1676 users visited in the last hour