Downloading Human And Other Completely Sequenced Proteomes To Search Homologs
3
2
Entering edit mode
10.1 years ago
Pappu ★ 2.1k

I want to download the human and other completely sequenced proteomes in order to search for homologs. A uniprot search results in ~136500 sequences in case of human:

http://www.uniprot.org/uniprot/?query=taxonomy%3A9606&sort=score

Searching for a protein sequence among these sequences yields too many homologs in human which is impossible. CD-HIT filtering by 90% sequence identity does not not reduce the number of hits much. The reviewed ~20000 entries in case of human do not include all the human proteins. I am wondering if Ensembl would be a better choice.

uniprot • 2.2k views
ADD COMMENT
2
Entering edit mode
10.1 years ago

See also this FAQ: What is the human complete proteome? http://www.uniprot.org/faq/48

ADD COMMENT
0
Entering edit mode
10.1 years ago
hpmcwill ★ 1.2k

See the UniProt complete and reference proteome sets for a more appropriate set for this kind of search. While UniProtKB contains 136,536 entries describing human proteins, the corresponding reference proteome set contains 68,756 entries (see http://www.uniprot.org/taxonomy/9606).

ADD COMMENT
0
Entering edit mode

I am aware of that. As far as I know human has <30k protein sequences excluding alternative splicing. Ensembl seem to have ~100k human CDS.

ADD REPLY
0
Entering edit mode
10.1 years ago
Biojl ★ 1.7k

You can download that data from Ensembl. Take into account the transcript_biotype or gene_biotype tag. For human if you select only gene_biotype=protein_coding you'll end up with 22.836 transcripts in version 75 (biomart).

ADD COMMENT

Login before adding your answer.

Traffic: 2393 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6