I am in a fix. For my work, I need an exhaustive and up-to-date database of
all protein sequences, especially covering ALL eukaryotes sequenced till date.
Thus, I downloaded the latest version of nr database from ncbi's ftp site.
However, I find that it does not contain all the putative protein
sequences as listed in individual genome databases.
For discussion purposes, let us consider Cyanophora paradoxa (taxonomic
id: 2762). According to the website
http://cyanophora.rutgers.edu/cyanophora/blast.php, it has 32,167 protein
coding sequences. However, there are only 731 gi ids corresponding to this
species in the latest nr database (25May2014 version). The Cyanophora
paradoxa's complete genome was published in 2012 (Price DC et al, 2012).
Thus, to me the only option seems to be to download protein sequences from individual genome projects and combining identical entries in one entry. Finally I shall append them to nr database to get the exhaustive set of proteins. However before I begin on this mammoth task, I thought to enquire if there is simpler solution to this problem.
Any help or suggestions are greatly welcomed.
Thanks a lot for your time,