Question: Understanding NCBI identifiers
gravatar for tlorin
3.4 years ago by
tlorin260 wrote:

This is kind of a general question regarding NCBI accesion numbers.

Suppose I have this sequence


I want to blast it (using blastp and nr) onto the salmon database (Salmo salar). I get three roughly equivalent hits corresponding to three different IDs:

NP_001133180.1, CBL79147.1 and NP_001133181.1

I bet that there are not three different genes. Thus, which sequence(s) should I consider as the 'good' one(s)? The more recent? The 'NP' ones? I could not find any info related to the detailed NCBI sequence identification process (but see this). Many thanks for your advice!

id ncbi • 836 views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by tlorin260

In general you should use RefSeq/Swiss-Prot database for protein searches at NCBI since they are likely to contain better curated representatives.

ADD REPLYlink written 3.4 years ago by genomax78k
gravatar for Cliff Beall
3.4 years ago by
Cliff Beall450
Cliff Beall450 wrote:

Those are almost the same protein sequence from the same organism but not exactly. It might be repeated genes, alternative splicing, variation between individuals, or even sequencing errors.

nr is meant to be non-redundant so it will have an entry for every different protein that someone put into the databases. You would need to follow up on the publications listed in the entries to track down exactly what is going on.

ADD COMMENTlink written 3.4 years ago by Cliff Beall450

I know that nr is supposed to be non-redundant (that's why I use it), but then why are there only 2 hits left (NP_001133180.1 and NP_001133181.1) when we blast the sequence onto the RefSeq database? Which one should I trust 'in general'? It seems to be that everyone has a 'feeling' about this but I cannot find any way of being sure (based on the sole ID) ;-) But I agree that we can do manual curation of author statements, check contig ID, etc. It's just that for many many genes, it's not possible, and the curation based on ID to avoid redundant sequences should be possible :)

ADD REPLYlink written 3.4 years ago by tlorin260

Those two RefSeq ID's have not been subjected to final NCBI review so it is possible that they may be collapsed into one entry after that point.

ADD REPLYlink written 3.4 years ago by genomax78k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1667 users visited in the last hour