What Is The Species Distribution Of Uniprot
3
0
Entering edit mode
9.3 years ago
mtyler.jason ▴ 120

I want to know about the species distribution of uniprot. How many human proteins does uni prot have. How many from other species. Is there any way to know about this information about the whole uniprot protein database?

uniprot protein • 2.9k views
2
Entering edit mode
9.3 years ago
\$ curl -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz" |\
gunzip -c | grep -E '^OS ' | cut -c6- | sort | uniq -c | sort -n

(...)
4127 Dictyostelium discoideum (Slime mold).
4185 Bacillus subtilis (strain 168).
4431 Escherichia coli (strain K12).
5097 Schizosaccharomyces pombe (strain 972 / ATCC 24843) (Fission yeast).
5983 Bos taurus (Bovine).
6621 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast).
7875 Rattus norvegicus (Rat).
12545 Arabidopsis thaliana (Mouse-ear cress).
16642 Mus musculus (Mouse).
20273 Homo sapiens (Human).

1
Entering edit mode
9.3 years ago
Hamish ★ 3.2k

A place to start is the UniProt statistics pages:

These include details of the taxonomic distribution of the current UniProtKB entries.

0
Entering edit mode
9.2 years ago
Jerven ▴ 650

UniProt browse by taxonomy is a way to explore the taxonomic distribution for all of UniProtKB. However, as UniProt uses the NCBI taxonomy there are things in there that can surprise the unaware biologist. For example. Homo sapiens, has two subspecies neanderthalensis and ssp. Denisova (don't ask me why, it just is... ). An other is that up to now there was basically a 1 to 1 taxid to genome project for bacterial species/strains/subspecies. Which is going to change soon.

1
Entering edit mode

UniProt uses a modified version of the NCBI Taxonomy (see UniProt Taxonomy) which:

• Uses an alternative authority for some taxa. Thus different scientific and common names are used for those taxa in UniProt, the names used in NCBI Taxonomy (and thus in INSDC) are handled as synonyms.
• Additional taxa. Since UniProt receives submissions of protein sequences, it sometimes has to provide a taxonomy node before the organism is available in the NCBI Taxonomy.

The taxonomy identifiers (e.g. 9606 for Homo sapiens) should be consistent between the two taxonomies so mapping between them should be simple.

The handling of archaeological taxa is always a matter of conjecture, since any classification is based on limited information and are subject to change as more examples are discovered and examined. The case of early humans it is unclear what the evolutionary relationships are since few examples are known (see Homo (genus))). For the moment NCBI Taxonomy has placed Denisova and Neanderthal as subspecies, presumably because this makes certain types of searches and analysis easier (e.g. using Homo sapiens to provide a reference genome), as the sequence data for these species improves this positioning will likely change to incorporate the new information.