I want to know about the species distribution of uniprot. How many human proteins does uni prot have. How many from other species. Is there any way to know about this information about the whole uniprot protein database?
$ curl -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz" |\ gunzip -c | grep -E '^OS ' | cut -c6- | sort | uniq -c | sort -n (...) 4127 Dictyostelium discoideum (Slime mold). 4185 Bacillus subtilis (strain 168). 4431 Escherichia coli (strain K12). 5097 Schizosaccharomyces pombe (strain 972 / ATCC 24843) (Fission yeast). 5983 Bos taurus (Bovine). 6621 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast). 7875 Rattus norvegicus (Rat). 12545 Arabidopsis thaliana (Mouse-ear cress). 16642 Mus musculus (Mouse). 20273 Homo sapiens (Human).
A place to start is the UniProt statistics pages:
These include details of the taxonomic distribution of the current UniProtKB entries.
UniProt browse by taxonomy is a way to explore the taxonomic distribution for all of UniProtKB. However, as UniProt uses the NCBI taxonomy there are things in there that can surprise the unaware biologist. For example. Homo sapiens, has two subspecies neanderthalensis and ssp. Denisova (don't ask me why, it just is... ). An other is that up to now there was basically a 1 to 1 taxid to genome project for bacterial species/strains/subspecies. Which is going to change soon.