What is a difference between uniprot and 'nr. Non-redundant GenBank...'?
I would like to understand what is the difference between

and

Any comments would be very appreciated. Btw, I am only interested in the 'human' subset of both.

Take a look at this answer here.

But I'm wondering about this too now, given this page on UniProt seems to suggest (to me) that NR and UniProt (i.e., TrEBML and SwissProt) are more equivalent to one another now than they were years ago (that answer is over a three quarters of a decade old). BTW, these are the kinds of proteins UniProt excludes.

Since @matt is specifically looking at the proteome (LINK) on UniProt page it is only referring to curated protein entries (20380 reviewed + 56,647 un-reviewed). If you simply look at UniProtKb 2021_20 results for Human then there are 20395 (reviewed) and 175,716 (un-reviewed) entries as of today.

nr database as it says it a non-redundant collection of sequences. It may contain sequences that are partial. As of today entries labeled with Human (taxID 9606) are

 $blastdbcmd -db nr -taxids 9606 -outfmt %a | wc -l
2929407

As a test I ran locally 'UniProtKB' and 'nr' against a AA sequence of 'Immunoglobulin kappa constant', P01834 (https://www.uniprot.org/uniprot/P01834). While in the former I get what one expects (left), in 'nr' it is not (right - the match with highest score)

I obviously miss something here as I would expect 'nr' storing such basic sequences. Perhaps the https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz is not the entire human subset of 'nr'?

Thanks for all the answers so far anyways.
P01834.2


While human proteome is reasonably complete it is still evolving so there is bound to be some discrepancy between databases. Not sure what it is that you want to do but UniProt proteome will likely be the best best representation you have at the moment. RefSeq should be close behind since those are also human curated datasets.

I would like to be able to do ‘blastp’ with clients proprietary sequences which cannot by processed on public website.

Until now I did it with UniProtKB but wanted to run it on 'nr' databse as well which some of my team colleagues used initially. Regarding the 'nr' database I got the following comment from NCBI User Services:

'The protein nr database is NOT organized by taxonomic breakdown. In other word, human sequences could be (and more likely are) present in every volumes.'

That means I would have to download all the 47 files (between 2-3GB each) which I wanted to avoid. I started to work with 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz' but that didn't pass the P01834 test.

I am in process to download all the 47 nr files which will take hours-days and eat up 80% of my free hard drive space.

clients proprietary sequences which cannot by processed on public website.

No option but to download nr indexes and do the search locally then. Be sure to download the taxid files as well from where you are getting the nr files.

After all, I realised I don't have that much space on my hard drive, after unpacking these 47 files it would need about at least 750GB. However, all your comments have convinced me that doing it is of not much use given we have UniProtKB. Thank you all for comments!

NR is 750GB? My local copy clocks in at 135GB uncompressed, and it's less than a year old. The diamond database is about the same size. Did you get the NR FASTA off of here: https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ ?

Ah, I was dowlnoading the 47 nr files from https://ftp.ncbi.nlm.nih.gov/blast/db

I didn't know there is FASTA subfolder... Thanks, I might try after all!

I didn't know that for a long time either, it's really cool that NCBI offers all that so accessibly.

Good luck, and let us know how it went!!

Edit: just want to mention, I think it is possible to restrict a search with diamond to a specific taxon just like in blast. Take a look at their help documentation and probably also these GitHub issues.

There is no point in downloading the fasta files since you will need to make the index yourself.

Using diamond or blast locally will require tens of GB of RAM (or it would be very slow if swap disks come into play). There are no simple solutions here. If you don't have necessary hardware available locally consider using a cloud environment.

I'd add to GenoMax 's point that if you're going to run a search against NR on a local machine, it's probably unwise at this point to use blastp. Diamond or MMSeqs2 under maximum sensitivity would be way, way faster at some loss of sensitivity (the latest version of Diamond is as sensitive as blast is but is ~ 80x faster).

Thanks Dunois, very useful links. On this forum I was re-directed to https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz for the human part of ''nr. Non-redundant GenBank...' database. It is twice as big as the UniProt and I wonder why.

RefSeq entries contain isoforms.

>NP_001372292.1 tudor domain-containing protein 1 isoform 1 [Homo sapiens]
>NP_001372293.1 tudor domain-containing protein 1 isoform 3 [Homo sapiens]
>NP_001372294.1 tudor domain-containing protein 1 isoform 4 [Homo sapiens]
>NP_001372295.1 tudor domain-containing protein 1 isoform 5 [Homo sapiens]
>NP_001372296.1 tudor domain-containing protein 1 isoform 6 [Homo sapiens]
>NP_001372297.1 tudor domain-containing protein 1 isoform 7 [Homo sapiens]
>NP_001372298.1 tudor domain-containing protein 1 isoform 7 [Homo sapiens]
>NP_001372299.1 tudor domain-containing protein 1 isoform 8 [Homo sapiens]
>NP_001372300.1 tudor domain-containing protein 1 isoform 9 [Homo sapiens]
>NP_001372301.1 tudor domain-containing protein 1 isoform 10 [Homo sapiens]
>NP_001372302.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]
>NP_001372303.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]
>NP_001372304.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]
>NP_001372305.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]
>NP_001372306.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]
>NP_001372307.1 neuroblastoma breakpoint family member 15 isoform 1 [Homo sapiens]