Question

Genbank,Ncbi Refseq Or Uniprot Protein Sequences

7

Entering edit mode

12.6 years ago

Woa ★ 2.9k

I have to construct a protein database of a sequenced organism for a proteomics search. Protein sequences from which repositories out of Genbank, NCBI Refseq and UniprotKB will be better for this purpose?

Thanks

WoA

proteomics protein refseq uniprot • 16k views

ADD COMMENT • link updated 12.6 years ago by Craig ▴ 30 • written 12.6 years ago by Woa ★ 2.9k

score 7 · Answer 1 · 2011-09-12

7

Entering edit mode

12.6 years ago

Martijn Van Iersel ▴ 570

UniprotKB contains the most rich, accurate, high-quality data. Genbank contains raw data, it could be very redundant, and you might have to do a lot of filtering yourself. Refseq is not so richly annotated, but at least it's only non-redundant sequences.

So my first choice would be to go with UniprotKB, second RefSeq, and third Genbank. But it also depends on whether the organism you're interested in has sufficient data in each resource.

Would you care to share which organism you're interested in?

ADD COMMENT • link 12.6 years ago by Martijn Van Iersel ▴ 570

0

Entering edit mode

Many Thanks !!! Can somebody tell me what is the difference between Uniprot "Complete Proteome set" and the combined reviewed (UniProtKB/Swiss-Prot) and unreviewed (UniProtKB/TrEMBL) entries

For some organisms the difference is negligible but for others, so far I've seen the difference is by around 100 entries.

ADD REPLY • link 12.6 years ago by Woa ★ 2.9k

Ram · Answer 2 · 2011-09-12

4

Entering edit mode

12.6 years ago

Chris Evelo 10k

You can find the answer to your second question: "what is the difference between Uniprot "Complete Proteome set" and the combined reviewed (UniProtKB/Swiss-Prot) and unreviewed (UniProtKB/TrEMBL) entries?" on the [?]UniProt Homepage[?]:

Swiss-Prot, which is manually annotated and reviewed.
TrEMBL, which is automatically annotated and is not reviewed.

UniProt really is a combination of two resources: SwissProt and trEMBL.

SwissProt is a high quality, because highly curated, real protein database. In fact it is one of the oldest databases we have and it is maintained by real protein experts.

trEMBL on the other hand is not a database of real proteins at all. It is a database of translated nucleotide sequences from EMBL (hence trEMBL). These can very well not-exist in real biology or just be wrongly translated (miss an exon or whatever). The two were combined for practical reasons but it is very good to be aware of the difference.

ADD COMMENT • link 12.6 years ago by Chris Evelo 10k

0

Entering edit mode

Thanks for your answer. I think I should fetch "Complete Proteome set" whenever availble for the organism. However The "complete proteome" contains only the canonical sequences and not all splice-variants. Is there any way to get all the splice variants ?

ADD REPLY • link 12.6 years ago by Woa ★ 2.9k

0

Entering edit mode

When you go to download the FASTA (assuming that is what you are using), e.g. http://www.uniprot.org/uniprot/?query=organism%3a9606+keyword%3a181&format=*, you get a choice to download the canonical sequence data, or canonical and isoform sequence data. The latter presumably includes splice variants as separate protein entries.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.6 years ago by Craig ▴ 30

score 2 · Answer 3 · 2011-09-12

2

Entering edit mode

12.6 years ago

Larry_Parnell 16k

What I would like to see is data that can link to mRNA isoforms. RefSeq allows this. GenBank would be noisy as Martijn says. The mRNA isoforms can be important because they are expressed to different levels according to cell type, temporal patterns (circadian, developmental), and responses to stimuli. These points could be quite critical to the design of the experiment whose data you'll now analyze or critical to the hypotheses addressed.

ADD COMMENT • link 12.6 years ago by Larry_Parnell 16k

0

Entering edit mode

Thanks!!I'll look into it

ADD REPLY • link 12.6 years ago by Woa ★ 2.9k

score 0 · Answer 4 · 2011-09-28

0

Entering edit mode

12.6 years ago

Craig ▴ 30

For mass spectrometry–based proteomics, the International Protein Index (IPI, http://www.ebi.ac.uk/IPI/IPIhelp.html) has been a popular choice for common organisms. For some reason they don't have yeast but Saccharomyces Genome Database (SGD, http://www.yeastgenome.org/) fills in nicely there. However, IPI is closing soon, and they recommend UniProt complete proteome sets (http://www.uniprot.org/faq/15) as a replacement. Overall, UniProt seems to provide good information for pretty much any organism, even if it doesn't have a complete proteome set yet, and it is definitely the most extensive, so I would recommend just going there for everything.

ADD COMMENT • link 12.6 years ago by Craig ▴ 30

0

Entering edit mode

Thanks Craig, can you tell me where NCBI NR stands compared to, say Uniprot? Is it less annonated and more redundant(even though they call it NR)?

ADD REPLY • link 12.6 years ago by Woa ★ 2.9k

0

Entering edit mode

Unfortunately I have never used NR so I can't answer this question.

ADD REPLY • link 12.6 years ago by Craig ▴ 30

0

Entering edit mode

NCBI nr db for protein is explained here: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide

ADD REPLY • link 12.1 years ago by Lhl ▴ 760