I am currently trying to find for a list of 16K proteomes retrieved by UniProtKB the number of protein-coding genes for each one of them. What I would like to achieve is
UniProt TaxID Organism Protein numbers Protein-coding genes
83333 Escherichia coli strain K12 ------- ---------
I am able to fetch for some of the proteomes this kind of information, using a EFetch, but it will work if the TaxID is the same for UniProt and GenBank (like 9606 in the case of Human). E.coli i.e is problematic, because in the NCBI Taxonomy the TaxID 83333 is a collection of all the E.coli strains and in the output from Efetch there are no genes associated to that TaxID. The solution of parsing the output of efetch using the organism name is a pain because UniProt and Genbank have slight variations also on the Organism name (E.coli strain K12 for UniProt, E.coli str. K-12) and almost every proteome Name has a slight (but different everytime) variation in the Organism name.
Do you have any suggestions on how to achieve this?