Question: Understanding NCBI identifiers
20 months ago
tlorin230 wrote:

This is kind of a general question regarding NCBI accesion numbers.

Suppose I have this sequence


I want to blast it (using blastp and nr) onto the salmon database (Salmo salar). I get three roughly equivalent hits corresponding to three different IDs:

NP_001133180.1, CBL79147.1 and NP_001133181.1

I bet that there are not three different genes. Thus, which sequence(s) should I consider as the 'good' one(s)? The more recent? The 'NP' ones? I could not find any info related to the detailed NCBI sequence identification process (but see this). Many thanks for your advice!

20 months ago by tlorin230

In general you should use RefSeq/Swiss-Prot database for protein searches at NCBI since they are likely to contain better curated representatives.

genomax
20 months ago
Cliff Beall
Cliff Beall450 wrote:

Those are almost the same protein sequence from the same organism but not exactly. It might be repeated genes, alternative splicing, variation between individuals, or even sequencing errors.

nr is meant to be non-redundant so it will have an entry for every different protein that someone put into the databases. You would need to follow up on the publications listed in the entries to track down exactly what is going on.

Cliff Beall

I know that nr is supposed to be non-redundant (that's why I use it), but then why are there only 2 hits left (NP_001133180.1 and NP_001133181.1) when we blast the sequence onto the RefSeq database? Which one should I trust 'in general'? It seems to be that everyone has a 'feeling' about this but I cannot find any way of being sure (based on the sole ID) ;-) But I agree that we can do manual curation of author statements, check contig ID, etc. It's just that for many many genes, it's not possible, and the curation based on ID to avoid redundant sequences should be possible :)

tlorin

Those two RefSeq ID's have not been subjected to final NCBI review so it is possible that they may be collapsed into one entry after that point.

genomax
