Question

Understanding NCBI identifiers

0

Entering edit mode

7.6 years ago

tlorin ▴ 360

This is kind of a general question regarding NCBI accesion numbers.

Suppose I have this sequence

>myseq
MGQ-----NSPNLLR------LSQ
--TLVGSSLLSSPSSPTTLKVKMPHAFPFLTPDQ-KKELSDIAHKIVAKGKGILAADES-
--TGSVAKRFQSINTENTEENRRLYRQLLFTA-DERAGPCIGGVIFFHETLYQKTDAGKT
FPEHVKSRGWVVGIKVDKGVVPLAGTN-GETTTQ---GLDGL--------YERCAQYKKD
GCDFAKWRCVLKITSTTPSRLAIMENCNVLARYASICQM--HGIVPIVEPEILPDGDHDL
KRTQYVTEKV-LAAMYKALSDHHVYLEGTLLKPNMVTAGHSCSHKYTHQDIAMATITALR
RTVPPAVPG--ITFLSGGQSEEEASINLNVMNQCPLHRPWAITFSYGRALQASALKAWGG
KPGNGKAAQEEFIKRAL------ANSLACQGKYVSSGN-S-A-AAGDSLFVANHAY

I want to blast it (using blastp and nr) onto the salmon database (Salmo salar). I get three roughly equivalent hits corresponding to three different IDs:

NP_001133180.1, CBL79147.1 and NP_001133181.1

I bet that there are not three different genes. Thus, which sequence(s) should I consider as the 'good' one(s)? The more recent? The 'NP' ones? I could not find any info related to the detailed NCBI sequence identification process (but see this). Many thanks for your advice!

ncbi id • 1.8k views

ADD COMMENT • link 7.6 years ago by tlorin ▴ 360

2

Entering edit mode

In general you should use RefSeq/Swiss-Prot database for protein searches at NCBI since they are likely to contain better curated representatives.

ADD REPLY • link 7.6 years ago by GenoMax 141k

score 0 · Answer 1 · 2016-09-13

0

Entering edit mode

7.6 years ago

Cliff Beall ▴ 470

Those are almost the same protein sequence from the same organism but not exactly. It might be repeated genes, alternative splicing, variation between individuals, or even sequencing errors.

nr is meant to be non-redundant so it will have an entry for every different protein that someone put into the databases. You would need to follow up on the publications listed in the entries to track down exactly what is going on.

ADD COMMENT • link 7.6 years ago by Cliff Beall ▴ 470

0

Entering edit mode

I know that nr is supposed to be non-redundant (that's why I use it), but then why are there only 2 hits left (NP_001133180.1 and NP_001133181.1) when we blast the sequence onto the RefSeq database? Which one should I trust 'in general'? It seems to be that everyone has a 'feeling' about this but I cannot find any way of being sure (based on the sole ID) ;-) But I agree that we can do manual curation of author statements, check contig ID, etc. It's just that for many many genes, it's not possible, and the curation based on ID to avoid redundant sequences should be possible :)

ADD REPLY • link 7.6 years ago by tlorin ▴ 360

0

Entering edit mode

Those two RefSeq ID's have not been subjected to final NCBI review so it is possible that they may be collapsed into one entry after that point.

ADD REPLY • link 7.6 years ago by GenoMax 141k