I am trying to get from the NCBI all proteins which have something to do with polyketides. Easy enough it seems, simply enter the search term "polyketide" into the Entrez form for protein searches and one gets back a nice list which can be downloaded in different formats:
However, upon closer inspection one will find entries which have absolutely nothing to do with polyketides, but were selected for other reasons. E.g.:
which was selected because the title of one of the associated papers says "Phenolic lipids synthesized by type III polyketide synthase confer penicillin resistance on Streptomyces griseus", but when one looks at the descriptions of the annotated features od the sequences, one quickly finds out the described sequences is nothing one is interested in as, in fact, the term "polyketide" is appearing nowhere:
FEATURES Location/Qualifiers source 1..500 /organism="Streptomyces griseus subsp. griseus NBRC 13350" /strain="NBRC 13350" /sub_species="griseus" /db_xref="taxon:455632" /synonym="Streptomyces griseus subsp. griseus IFO 13350" Protein 1..500 /product="methylmalonic acid semialdehyde dehydrogenase IolA" /calculated_mol_wt=51906 Region 5..482 /region_name="MMSDH" [...] ORIGIN
My simple idea to prevent this would be to search only in annotated features of the entries. I tried quite a lot of combinations with the NCBI search builder but wasn't able to find something which would get me what I need.
Have I overlooked something?
In case it is not possible to get what I need from the NCBI search, I thought of downloading the sequences for the initial "polyketide" search as GenBank formatted entries and then perform some own filtering. Are there any easy to use functionalities in Bioperl or EMBOSS which would do that without much programming?