Question: Understanding NCBI identifiers
0
gravatar for tlorin
14 months ago by
tlorin210
Switzerland
tlorin210 wrote:

This is kind of a general question regarding NCBI accesion numbers.

Suppose I have this sequence

>myseq
MGQ-----NSPNLLR------LSQ
--TLVGSSLLSSPSSPTTLKVKMPHAFPFLTPDQ-KKELSDIAHKIVAKGKGILAADES-
--TGSVAKRFQSINTENTEENRRLYRQLLFTA-DERAGPCIGGVIFFHETLYQKTDAGKT
FPEHVKSRGWVVGIKVDKGVVPLAGTN-GETTTQ---GLDGL--------YERCAQYKKD
GCDFAKWRCVLKITSTTPSRLAIMENCNVLARYASICQM--HGIVPIVEPEILPDGDHDL
KRTQYVTEKV-LAAMYKALSDHHVYLEGTLLKPNMVTAGHSCSHKYTHQDIAMATITALR
RTVPPAVPG--ITFLSGGQSEEEASINLNVMNQCPLHRPWAITFSYGRALQASALKAWGG
KPGNGKAAQEEFIKRAL------ANSLACQGKYVSSGN-S-A-AAGDSLFVANHAY

I want to blast it (using blastp and nr) onto the salmon database (Salmo salar). I get three roughly equivalent hits corresponding to three different IDs:

NP_001133180.1, CBL79147.1 and NP_001133181.1

I bet that there are not three different genes. Thus, which sequence(s) should I consider as the 'good' one(s)? The more recent? The 'NP' ones? I could not find any info related to the detailed NCBI sequence identification process (but see this). Many thanks for your advice!

id ncbi • 343 views
ADD COMMENTlink modified 14 months ago • written 14 months ago by tlorin210
2

In general you should use RefSeq/Swiss-Prot database for protein searches at NCBI since they are likely to contain better curated representatives.

ADD REPLYlink written 14 months ago by genomax37k
0
gravatar for Cliff Beall
14 months ago by
Cliff Beall440
Ohio
Cliff Beall440 wrote:

Those are almost the same protein sequence from the same organism but not exactly. It might be repeated genes, alternative splicing, variation between individuals, or even sequencing errors.

nr is meant to be non-redundant so it will have an entry for every different protein that someone put into the databases. You would need to follow up on the publications listed in the entries to track down exactly what is going on.

ADD COMMENTlink written 14 months ago by Cliff Beall440

I know that nr is supposed to be non-redundant (that's why I use it), but then why are there only 2 hits left (NP_001133180.1 and NP_001133181.1) when we blast the sequence onto the RefSeq database? Which one should I trust 'in general'? It seems to be that everyone has a 'feeling' about this but I cannot find any way of being sure (based on the sole ID) ;-) But I agree that we can do manual curation of author statements, check contig ID, etc. It's just that for many many genes, it's not possible, and the curation based on ID to avoid redundant sequences should be possible :)

ADD REPLYlink written 14 months ago by tlorin210

Those two RefSeq ID's have not been subjected to final NCBI review so it is possible that they may be collapsed into one entry after that point.

ADD REPLYlink written 14 months ago by genomax37k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 940 users visited in the last hour