Question: Searching Or Filtering Ncbi Entrez Results For A Keyword Only In Annotated Features
1
gravatar for Bach
6.5 years ago by
Bach550
Bach550 wrote:

Dear all,

I am trying to get from the NCBI all proteins which have something to do with polyketides. Easy enough it seems, simply enter the search term "polyketide" into the Entrez form for protein searches and one gets back a nice list which can be downloaded in different formats:

http://www.ncbi.nlm.nih.gov/protein?term=polyketide

However, upon closer inspection one will find entries which have absolutely nothing to do with polyketides, but were selected for other reasons. E.g.:

http://www.ncbi.nlm.nih.gov/protein/YP_001826350.1

which was selected because the title of one of the associated papers says "Phenolic lipids synthesized by type III polyketide synthase confer penicillin resistance on Streptomyces griseus", but when one looks at the descriptions of the annotated features od the sequences, one quickly finds out the described sequences is nothing one is interested in as, in fact, the term "polyketide" is appearing nowhere:

FEATURES             Location/Qualifiers
 source          1..500
                 /organism="Streptomyces griseus subsp. griseus NBRC 13350"
                 /strain="NBRC 13350"
                 /sub_species="griseus"
                 /db_xref="taxon:455632"
                 /synonym="Streptomyces griseus subsp. griseus IFO 13350"
 Protein         1..500
                 /product="methylmalonic acid semialdehyde dehydrogenase IolA"
                 /calculated_mol_wt=51906
 Region          5..482
                 /region_name="MMSDH"
[...]
ORIGIN

My simple idea to prevent this would be to search only in annotated features of the entries. I tried quite a lot of combinations with the NCBI search builder but wasn't able to find something which would get me what I need.

Have I overlooked something?

In case it is not possible to get what I need from the NCBI search, I thought of downloading the sequences for the initial "polyketide" search as GenBank formatted entries and then perform some own filtering. Are there any easy to use functionalities in Bioperl or EMBOSS which would do that without much programming?

Best, B.

ADD COMMENTlink modified 6.5 years ago by Neilfws48k • written 6.5 years ago by Bach550
1

did you try to narrow your search using an ENTREZ field .e.g: http://www.ncbi.nlm.nih.gov/nuccore?db=nuccore&cmd=search&term=polyketide[TITL]

ADD REPLYlink written 6.5 years ago by Pierre Lindenbaum121k

Yes, see my comment to Neil's answer.

ADD REPLYlink written 6.5 years ago by Bach550
2
gravatar for Neilfws
6.5 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

There are a couple of ways to approach this.

First, try:

polyketide[Title]

as your search. The qualifier "Title" searches for words in the DEFINITION line of the Genpept record. For example:

DEFINITION  Polyketide synthetase MbtC (polyketide synthase) [Mycobacterium
            canettii CIPT 140070017].

Second: yes, you could quite easily filter records using one of the Bio* project libraries, such as Bioperl, depending on what you call "not much programming." For example, you could search for "polyketide" in the /product tag of either the Genpept or the corresponding Genbank entry, for example:

CDS             1185534..1187168
                /gene="pks"
                /locus_tag="BN45_30045"
                /EC_number="2.3.1.86"
                /inference="ab initio prediction:AMIGene:2.0"
                /note="Evidence 3 : Function proposed based on presence of
                 conserved amino acid motif, structural feature or limited
                 homology; PubMedId : 11929527, 15525680; Product type pe :
                 putative enzyme"
                /codon_start=1
                /transl_table=11
                /product="Putative polyketide synthase Pks16"
                /protein_id="CCK63145.1"

See the Bioperl feature annotation HOWTO.

ADD COMMENTlink written 6.5 years ago by Neilfws48k

Searching via TITLE is a too strong restriction as it misses entries like EGC38264.1 (Title: "hypothetical protein DICPUDRAFT_149064 [Dictyostelium purpureum]") which does not have polyketide in the title but a nicely annotated polyketide synthase in the FEATURES section. I suppose I'll have to write a quick Bioperl script though I had hoped this to be a frequent enough problem for someone already having done this (or the NCBI having something I overlooked).

ADD REPLYlink written 6.5 years ago by Bach550

If you are looking for members of a particular protein family, you might consider parsing for annotated domains (such as InterPro features), or searching databases for proteins with the relevant domain(s). If you just want, as you stated initially, "something to do with polyketides", well that is a more difficult problem and you will have to deal with the ambiguity.

ADD REPLYlink written 6.5 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 626 users visited in the last hour