Entering edit mode
3.3 years ago
matt
▴
20
I would like to understand what is the difference between
- 'UniProt', e.g. UP000005640 URL: https://www.ebi.ac.uk/interpro/proteome/uniprot/UP000005640/
and
- 'nr. Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr.' as used e.g. for BLAST-ing on NCBI website, URL: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome
Any comments would be very appreciated. Btw, I am only interested in the 'human' subset of both.
Take a look at this answer here.
But I'm wondering about this too now, given this page on
UniProt
seems to suggest (to me) thatNR
andUniProt
(i.e.,TrEBML
andSwissProt
) are more equivalent to one another now than they were years ago (that answer is over a three quarters of a decade old). BTW, these are the kinds of proteinsUniProt
excludes.Since @matt is specifically looking at the
proteome
(LINK) on UniProt page it is only referring to curated protein entries (20380 reviewed + 56,647 un-reviewed). If you simply look at UniProtKb 2021_20 results forHuman
then there are 20395 (reviewed) and 175,716 (un-reviewed) entries as of today.nr
database as it says it a non-redundant collection of sequences. It may contain sequences that are partial. As of today entries labeled withHuman
(taxID 9606) areAs a test I ran locally 'UniProtKB' and 'nr' against a AA sequence of 'Immunoglobulin kappa constant', P01834 (https://www.uniprot.org/uniprot/P01834). While in the former I get what one expects (left), in 'nr' it is not (right - the match with highest score)
I obviously miss something here as I would expect 'nr' storing such basic sequences. Perhaps the https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz is not the entire human subset of 'nr'?
Thanks for all the answers so far anyways.
That's
RefSeq
you're looking at, I think. If you use theblast
webserver and search againstNR
with the taxonomic scope set to Homo sapiens, you will get this as your best hit.OK, I could reproduce it using the blast webserver (https://blast.ncbi.nlm.nih.gov/Blast.cgi) How do I get it locally? The file GRCh38_latest_protein.faa dos not seem to have it.
I don't know if there's any other way that's easier than just downloading all of
NR
and restricting the search target to Homo sapiens. But I think that's a bit irrelevant: the record in question comes fromUniProt
, so the easier way would just be to search through aUniProt
database (of H. sapiens sequences in this case).I don't think this is helping you though. I have a feeling the problem you're trying to solve is something else, and this is just something you encountered along the way.
UniProt entry says that there is
Experimental evidence at protein level
with following caveatFurther explanation is provided in help.
Thanks GenoMax but it's not exactly what I asked. How can I get locally the same results as in the blast-ing you did to find https://www.ncbi.nlm.nih.gov/protein/P01834.2?
As Dunois said below you should be able to get the same/similar result (you will have to test parameters for local blast) by downloading
nr
database and then limiting your searches to human entries with-taxids 9606
(human) option in yourblastp
command line. This will include protein entries from UniProt.That protein is there in the
nr
db.While human proteome is reasonably complete it is still evolving so there is bound to be some discrepancy between databases. Not sure what it is that you want to do but UniProt proteome will likely be the best best representation you have at the moment.
RefSeq
should be close behind since those are also human curated datasets.I would like to be able to do ‘blastp’ with clients proprietary sequences which cannot by processed on public website.
Until now I did it with UniProtKB but wanted to run it on 'nr' databse as well which some of my team colleagues used initially. Regarding the 'nr' database I got the following comment from NCBI User Services:
'The protein nr database is NOT organized by taxonomic breakdown. In other word, human sequences could be (and more likely are) present in every volumes.'
That means I would have to download all the 47 files (between 2-3GB each) which I wanted to avoid. I started to work with 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz' but that didn't pass the P01834 test.
I am in process to download all the 47 nr files which will take hours-days and eat up 80% of my free hard drive space.
No option but to download
nr
indexes and do the search locally then. Be sure to download the taxid files as well from where you are getting thenr
files.After all, I realised I don't have that much space on my hard drive, after unpacking these 47 files it would need about at least 750GB. However, all your comments have convinced me that doing it is of not much use given we have UniProtKB. Thank you all for comments!
NR
is 750GB? My local copy clocks in at 135GB uncompressed, and it's less than a year old. Thediamond
database is about the same size. Did you get theNR
FASTA
off of here: https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ ?Ah, I was dowlnoading the 47 nr files from https://ftp.ncbi.nlm.nih.gov/blast/db
I didn't know there is FASTA subfolder... Thanks, I might try after all!
I didn't know that for a long time either, it's really cool that
NCBI
offers all that so accessibly.Good luck, and let us know how it went!!
Edit: just want to mention, I think it is possible to restrict a search with
diamond
to a specific taxon just like inblast
. Take a look at their help documentation and probably also theseGitHub
issues.There is no point in downloading the fasta files since you will need to make the index yourself.
Using
diamond
orblast
locally will require tens of GB of RAM (or it would be very slow if swap disks come into play). There are no simple solutions here. If you don't have necessary hardware available locally consider using a cloud environment.I'd add to GenoMax 's point that if you're going to run a search against
NR
on a local machine, it's probably unwise at this point to useblastp
.Diamond
orMMSeqs2
under maximum sensitivity would be way, way faster at some loss of sensitivity (the latest version ofDiamond
is as sensitive asblast
is but is ~ 80x faster).Thanks Dunois, very useful links. On this forum I was re-directed to https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_protein.faa.gz for the human part of ''nr. Non-redundant GenBank...' database. It is twice as big as the UniProt and I wonder why.
RefSeq entries contain isoforms.