Question

Searching Or Filtering Ncbi Entrez Results For A Keyword Only In Annotated Features

1

Entering edit mode

11.3 years ago

Bach ▴ 550

Dear all,

I am trying to get from the NCBI all proteins which have something to do with polyketides. Easy enough it seems, simply enter the search term "polyketide" into the Entrez form for protein searches and one gets back a nice list which can be downloaded in different formats:

http://www.ncbi.nlm.nih.gov/protein?term=polyketide

However, upon closer inspection one will find entries which have absolutely nothing to do with polyketides, but were selected for other reasons. E.g.:

http://www.ncbi.nlm.nih.gov/protein/YP_001826350.1

which was selected because the title of one of the associated papers says "Phenolic lipids synthesized by type III polyketide synthase confer penicillin resistance on Streptomyces griseus", but when one looks at the descriptions of the annotated features od the sequences, one quickly finds out the described sequences is nothing one is interested in as, in fact, the term "polyketide" is appearing nowhere:

FEATURES             Location/Qualifiers
 source          1..500
                 /organism="Streptomyces griseus subsp. griseus NBRC 13350"
                 /strain="NBRC 13350"
                 /sub_species="griseus"
                 /db_xref="taxon:455632"
                 /synonym="Streptomyces griseus subsp. griseus IFO 13350"
 Protein         1..500
                 /product="methylmalonic acid semialdehyde dehydrogenase IolA"
                 /calculated_mol_wt=51906
 Region          5..482
                 /region_name="MMSDH"
[...]
ORIGIN

My simple idea to prevent this would be to search only in annotated features of the entries. I tried quite a lot of combinations with the NCBI search builder but wasn't able to find something which would get me what I need.

Have I overlooked something?

In case it is not possible to get what I need from the NCBI search, I thought of downloading the sequences for the initial "polyketide" search as GenBank formatted entries and then perform some own filtering. Are there any easy to use functionalities in Bioperl or EMBOSS which would do that without much programming?

Best, B.

ncbi search filtering genbank bioperl • 3.7k views

ADD COMMENT • link updated 11.3 years ago by Neilfws 49k • written 11.3 years ago by Bach ▴ 550

1

Entering edit mode

did you try to narrow your search using an ENTREZ field .e.g: http://www.ncbi.nlm.nih.gov/nuccore?db=nuccore&cmd=search&term=polyketide[TITL]

ADD REPLY • link 11.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yes, see my comment to Neil's answer.

ADD REPLY • link 11.3 years ago by Bach ▴ 550

score 2 · Answer 1 · 2013-01-22

There are a couple of ways to approach this.

First, try:

polyketide[Title]

as your search. The qualifier "Title" searches for words in the DEFINITION line of the Genpept record. For example:

DEFINITION  Polyketide synthetase MbtC (polyketide synthase) [Mycobacterium
            canettii CIPT 140070017].

Second: yes, you could quite easily filter records using one of the Bio* project libraries, such as Bioperl, depending on what you call "not much programming." For example, you could search for "polyketide" in the /product tag of either the Genpept or the corresponding Genbank entry, for example:

CDS             1185534..1187168
                /gene="pks"
                /locus_tag="BN45_30045"
                /EC_number="2.3.1.86"
                /inference="ab initio prediction:AMIGene:2.0"
                /note="Evidence 3 : Function proposed based on presence of
                 conserved amino acid motif, structural feature or limited
                 homology; PubMedId : 11929527, 15525680; Product type pe :
                 putative enzyme"
                /codon_start=1
                /transl_table=11
                /product="Putative polyketide synthase Pks16"
                /protein_id="CCK63145.1"

See the Bioperl feature annotation HOWTO.