Question: Finding Organism Protein Databases
2.3 years ago by
I need to assemble some model seqs of my protein of interest so I have a control to compare to other sequences. I would like to do a nice swath of Eukaryotes and Bacteria (having at least 12 or 13 species of Eukaryotes and 5 or 6 species of Bacteria), as well as having multiple versions of the proteins from each species (i.e. Human Protein_A 1.1, Human Protein_A 1.2, Human Protein_A 2.3, Zebrafish Protein_A 1.1, etc...). SO...I know this is super easy to do with NCBI or UniProt, but my mentor wants me to get the proteins from organism-specific databases. I didn't think this would be a problem until I got about 4 organisms deep.

Human and Mouse and Zebrafish were pretty easy to find, but I'm having so much trouble finding sites that give me AA seqs for my protein (very common protein). Even the yeast genome website, which I was told would be a goldmine, has been useless to me.

Basically, is there an easy way to find organism databases or at least find sequences that link back to an organism database? UniProt kinda does this, but 20hrs of searching over the weekend made me give up on it.

ADD COMMENTlink modified 20 months ago by Jean-Karim Heriche18k • written 2.3 years ago by kgbenn12310

Have you looked at homologene or protein clusters from NCBI?

ADD REPLYlink written 2.3 years ago by genomax67k

I have never used these (and I will give them both a shot tonight), but I assume I will still be going through every sequence to see if it is cited back to an organism-specific database? or do these have some feature that allows you to narrow your search that way?

EDIT: Homologene was incredibly helpful! I still have the problem of it not coming from an organism-specific website, but these seqs seem to be of pretty high quality and I can at least review the submitters. Thanks!

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by kgbenn12310

Please use ADD REPLY/ADD COMMENT when responding to existing posts to keep threads logically organized.

Since we don't know what organisms you are interested in it is hard to tell if there are organism specific database available. BTW: That is an odd requirement from your mentor and may not be satisfiable in all instances. If you can find the protein at NCBI (e.g. RefSeq) or at Uniprot (swissprot) then that should be good enough evidence that the sequence is real since both of these are manually curated databases.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax67k

I'm not even interested in any "specific" organisms, just a good cross-section of Domains. I totally agree though about RefSeq and UniProt. I think the problem is that they may not be the MOST up-to-date on physiology or...I have no idea. I'm just trying to put together diverse, quality seqs that I can use as a control for statistical analysis of unknown seqs.

Anyways, thanks for the help. I'll probably ask my mentor to clarify again or give me a hint of where to find these elusive organism-specific protein databases.

ADD REPLYlink written 2.3 years ago by kgbenn12310
20 months ago by
EMBL Heidelberg, Germany
A bit late maybe but I wonder why nobody mentioned Ensembl and Ensembl genomes. The added advantage over other resources is that regardless of organism, one can use the same API, no need to write a separate script for each organism.

ADD COMMENTlink written 20 months ago by Jean-Karim Heriche18k
