Entering edit mode
2.1 years ago
matt ▴ 20
I would like to understand what is the difference between
- 'UniProt', e.g. UP000005640 URL: https://www.ebi.ac.uk/interpro/proteome/uniprot/UP000005640/
- 'nr. Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr.' as used e.g. for BLAST-ing on NCBI website, URL: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome
Any comments would be very appreciated. Btw, I am only interested in the 'human' subset of both.
Take a look at this answer here.
But I'm wondering about this too now, given this page on
UniProtseems to suggest (to me) that
SwissProt) are more equivalent to one another now than they were years ago (that answer is over a three quarters of a decade old). BTW, these are the kinds of proteins
Since @matt is specifically looking at the
proteome(LINK) on UniProt page it is only referring to curated protein entries (20380 reviewed + 56,647 un-reviewed). If you simply look at UniProtKb 2021_20 results for
Humanthen there are 20395 (reviewed) and 175,716 (un-reviewed) entries as of today.
nrdatabase as it says it a non-redundant collection of sequences. It may contain sequences that are partial. As of today entries labeled with
Human(taxID 9606) are
As a test I ran locally 'UniProtKB' and 'nr' against a AA sequence of 'Immunoglobulin kappa constant', P01834 (https://www.uniprot.org/uniprot/P01834). While in the former I get what one expects (left), in 'nr' it is not (right - the match with highest score)
I obviously miss something here as I would expect 'nr' storing such basic sequences. Perhaps the https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz is not the entire human subset of 'nr'?
Thanks for all the answers so far anyways.
RefSeqyou're looking at, I think. If you use the
blastwebserver and search against
NRwith the taxonomic scope set to Homo sapiens, you will get this as your best hit.
OK, I could reproduce it using the blast webserver (https://blast.ncbi.nlm.nih.gov/Blast.cgi) How do I get it locally? The file GRCh38_latest_protein.faa dos not seem to have it.
I don't know if there's any other way that's easier than just downloading all of
NRand restricting the search target to Homo sapiens. But I think that's a bit irrelevant: the record in question comes from
UniProt, so the easier way would just be to search through a
UniProtdatabase (of H. sapiens sequences in this case).
I don't think this is helping you though. I have a feeling the problem you're trying to solve is something else, and this is just something you encountered along the way.
UniProt entry says that there is
Experimental evidence at protein levelwith following caveat
Further explanation is provided in help.
Thanks GenoMax but it's not exactly what I asked. How can I get locally the same results as in the blast-ing you did to find https://www.ncbi.nlm.nih.gov/protein/P01834.2?
As Dunois said below you should be able to get the same/similar result (you will have to test parameters for local blast) by downloading
nrdatabase and then limiting your searches to human entries with
-taxids 9606(human) option in your
blastpcommand line. This will include protein entries from UniProt.
That protein is there in the
While human proteome is reasonably complete it is still evolving so there is bound to be some discrepancy between databases. Not sure what it is that you want to do but UniProt proteome will likely be the best best representation you have at the moment.
RefSeqshould be close behind since those are also human curated datasets.
I would like to be able to do ‘blastp’ with clients proprietary sequences which cannot by processed on public website.
Until now I did it with UniProtKB but wanted to run it on 'nr' databse as well which some of my team colleagues used initially. Regarding the 'nr' database I got the following comment from NCBI User Services:
'The protein nr database is NOT organized by taxonomic breakdown. In other word, human sequences could be (and more likely are) present in every volumes.'
That means I would have to download all the 47 files (between 2-3GB each) which I wanted to avoid. I started to work with 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz' but that didn't pass the P01834 test.
I am in process to download all the 47 nr files which will take hours-days and eat up 80% of my free hard drive space.
No option but to download
nrindexes and do the search locally then. Be sure to download the taxid files as well from where you are getting the
After all, I realised I don't have that much space on my hard drive, after unpacking these 47 files it would need about at least 750GB. However, all your comments have convinced me that doing it is of not much use given we have UniProtKB. Thank you all for comments!
NRis 750GB? My local copy clocks in at 135GB uncompressed, and it's less than a year old. The
diamonddatabase is about the same size. Did you get the
FASTAoff of here: https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ ?
Ah, I was dowlnoading the 47 nr files from https://ftp.ncbi.nlm.nih.gov/blast/db
I didn't know there is FASTA subfolder... Thanks, I might try after all!
I didn't know that for a long time either, it's really cool that
NCBIoffers all that so accessibly.
Good luck, and let us know how it went!!
Edit: just want to mention, I think it is possible to restrict a search with
diamondto a specific taxon just like in
blast. Take a look at their help documentation and probably also these
There is no point in downloading the fasta files since you will need to make the index yourself.
blastlocally will require tens of GB of RAM (or it would be very slow if swap disks come into play). There are no simple solutions here. If you don't have necessary hardware available locally consider using a cloud environment.
I'd add to GenoMax 's point that if you're going to run a search against
NRon a local machine, it's probably unwise at this point to use
MMSeqs2under maximum sensitivity would be way, way faster at some loss of sensitivity (the latest version of
Diamondis as sensitive as
blastis but is ~ 80x faster).
Thanks Dunois, very useful links. On this forum I was re-directed to https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz for the human part of ''nr. Non-redundant GenBank...' database. It is twice as big as the UniProt and I wonder why.
RefSeq entries contain isoforms.