Question

Are old versions of NCBI's nr stored somewhere?

3

Entering edit mode

10.0 years ago

5heikki 11k

Hello,

I'd like to study how NCBI's non-redundant protein database (nr) has developed over the years. However, I'm yet to find a way to download anything but the latest release from the NCBI ftp. Are those old versions lost for good from the public domain?

ncbi blast nr • 7.3k views

ADD COMMENT • link updated 7.4 years ago by natasha.sernova ★ 4.0k • written 10.0 years ago by 5heikki 11k

0

Entering edit mode

I think I could live with protein subsets of GenBank releases, but I haven't exactly figured out from where to download those either.

ADD REPLY • link 10.0 years ago by 5heikki 11k

0

Entering edit mode

You can try asking the folks at NCBI if they have archived versions they could give you access to... see http://www.ncbi.nlm.nih.gov/About/glance/contact_info.html for details of how to contact them.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by hpmcwill ★ 1.2k

2

Entering edit mode

8.2 years ago

lukaskoz ▴ 30

yes, NCBI should store old nr, e.g. every month, they are crucial for any bioinformatics, in meanwhile you can use my copy (far from perfect, but better than nothing)

ftp://genesilico.pl/lukaskoz/biological_databases/

ADD COMMENT • link 8.2 years ago by lukaskoz ▴ 30

0

Entering edit mode

This is a blessing! You are my hero.

ADD REPLY • link 3.5 years ago by igor • 0

1

Entering edit mode

10.0 years ago

Neilfws 49k

I don't believe that NCBI archives old database versions.

The best I can suggest is that you start from GenPept, extract the sequence and the submission date, then bin sequences by a suitable date interval and derive your own non-redundant set using e.g. CD-HIT. It would be a lot of work.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by Neilfws 49k

0

Entering edit mode

Yeah, this is a reasonable approach. The below script (requires EDirect in path) fetches sequences added in a given year. For cumulative databases one obviously needs to fetch the sequences generated before these years too. Anyway, I'm kind of shocked how difficult this whole task turned out to be. One would think that the whole point of e.g. GenBank releases was that you could go back to older releases to e.g. verify the results of some study..

#!/bin/bash
for i in {1990..2014}
do
esearch -db protein -query "("$I"[Publication Date])" | efetch -format fasta | grep . > $i.fasta
done

Although I have to point out that EDirect utilities are pretty horrible with large downloads as is the usual case with all non-ftp traffic between any location and NCBI, so the above script is guaranteed to fail in downloading all the proteins of the later years..

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by 5heikki 11k

1

Entering edit mode

7.4 years ago

natasha.sernova ★ 4.0k

See my answer to this post, you will find NCBI-old version link inside:

where can I get environmental bacteria genome in fasta format (as many as possible)?

This one is just for bacteria:

ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/

This one is for the others:

ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq

ADD COMMENT • link 7.4 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

7.4 years ago

blanca ▴ 10

In case someone else needs the old nr version (with gi numbers), I have found it here:

http://www.matrixscience.com/help/seq_db_setup_nr_gi.html

ADD COMMENT • link 7.4 years ago by blanca ▴ 10

Ram · Accepted Answer · 2014-04-15

As far as I am aware NCBI do not provide archived versions of the 'nr' database, although they might be available upon request.

However since most of the sequences in 'nr' come from the protein translations in GenBank and UniProt provide archived releases for UniProtKB (which includes translations from EMBL-Bank), the UniProt releases would probably cover what you need. See ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/.

Alternativly the UniProt's UniParc database is equivalent to the NCBI's 'nr' database, and provides additonal date information which would allow you to create subsets based on the database at a particular date. For the XML version of the UniParc database, which contains the additional information, see ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/

Please note: the NCBI 'nr' database and the UniParc database are sets of non-identical sequences (i.e. the database contain one sequence for each unique sequence, with meta-data providing details of all the source entries containing the sequence). Non-redundant sequence databases such as UniRef or those generated with CD-HIT are different, and merge subsequences such as those from sequencing fragments into either the longest or a representitive sequence. To generate your own 'nr' like database(s) use the 'nrdb' program (http://blast.advbiocomp.com/pub/nrdb/) on your collection of sequences.