Question: Number of protein-coding genes per organism given the TaxID
gravatar for bordin89
2.8 years ago by
bordin890 wrote:


I am currently trying to find for a list of 16K proteomes retrieved by UniProtKB the number of protein-coding genes for each one of them. What I would like to achieve is

UniProt TaxID Organism Protein numbers Protein-coding genes

83333 Escherichia coli strain K12 ------- ---------

I am able to fetch for some of the proteomes this kind of information, using a EFetch, but it will work if the TaxID is the same for UniProt and GenBank (like 9606 in the case of Human). E.coli i.e is problematic, because in the NCBI Taxonomy the TaxID 83333 is a collection of all the E.coli strains and in the output from Efetch there are no genes associated to that TaxID. The solution of parsing the output of efetch using the organism name is a pain because UniProt and Genbank have slight variations also on the Organism name (E.coli strain K12 for UniProt, E.coli str. K-12) and almost every proteome Name has a slight (but different everytime) variation in the Organism name.

Do you have any suggestions on how to achieve this?

Thank you.

uniprot taxonomy ncbi gene • 1.0k views
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by bordin890

How about this one:

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by a.zielezinski8.7k
gravatar for EagleEye
2.8 years ago by
EagleEye6.4k wrote:

Example to get all genes for Drosophila Melanogaster,

'dme' is the organism code. You can get the organism codes from following link (second column),

T00007  eco Escherichia coli K-12 MG1655    Prokaryotes;Bacteria;Gammaproteobacteria - Enterobacteria;Escherichia
T00068  ecj Escherichia coli K-12 W3110 Prokaryotes;Bacteria;Gammaproteobacteria - Enterobacteria;Escherichia
T00666  ecd Escherichia coli K-12 DH10B Prokaryotes;Bacteria;Gammaproteobacteria - Enterobacteria;Escherichia
T00913  ebw Escherichia coli BW2952 Prokaryotes;Bacteria;Gammaproteobacteria - Enterobacteria;Escherichia
T02541  ecok    Escherichia coli K-12 MDS42

Note: The above gene list only contains the gene having KEGG functions.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by EagleEye6.4k
gravatar for bordin89
2.8 years ago by
bordin890 wrote:

Thanks for the reply, but that doesn't suits what I was l looking for, since a lot of organisms are not present in KEGG. The main issue I guess is that UniProt usually groups an entire subgroup of organisms in one TaxID (like 83333 for E.coli K12) and one non-redundant proteome, while if you dump the NCBI Gene DB using a query like

"all[Filter] AND ("source_genomic"[properties] AND (gene_nucleotide_pos[filter] AND "genetype protein coding"[Properties]) AND alive[prop])"

in a EFetch script it will recover all the E.coli genes associated with their strain or version TaxID, not 83333 unluckily.

Can I modify the query somehow? Or there is a better way around this?


ADD COMMENTlink written 2.8 years ago by bordin890
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1442 users visited in the last hour