Searching Or Filtering Ncbi Entrez Results For A Keyword Only In Annotated Features
1
1
Entering edit mode
11.3 years ago
Bach ▴ 550

Dear all,

I am trying to get from the NCBI all proteins which have something to do with polyketides. Easy enough it seems, simply enter the search term "polyketide" into the Entrez form for protein searches and one gets back a nice list which can be downloaded in different formats:

http://www.ncbi.nlm.nih.gov/protein?term=polyketide

However, upon closer inspection one will find entries which have absolutely nothing to do with polyketides, but were selected for other reasons. E.g.:

http://www.ncbi.nlm.nih.gov/protein/YP_001826350.1

which was selected because the title of one of the associated papers says "Phenolic lipids synthesized by type III polyketide synthase confer penicillin resistance on Streptomyces griseus", but when one looks at the descriptions of the annotated features od the sequences, one quickly finds out the described sequences is nothing one is interested in as, in fact, the term "polyketide" is appearing nowhere:

FEATURES             Location/Qualifiers
 source          1..500
                 /organism="Streptomyces griseus subsp. griseus NBRC 13350"
                 /strain="NBRC 13350"
                 /sub_species="griseus"
                 /db_xref="taxon:455632"
                 /synonym="Streptomyces griseus subsp. griseus IFO 13350"
 Protein         1..500
                 /product="methylmalonic acid semialdehyde dehydrogenase IolA"
                 /calculated_mol_wt=51906
 Region          5..482
                 /region_name="MMSDH"
[...]
ORIGIN

My simple idea to prevent this would be to search only in annotated features of the entries. I tried quite a lot of combinations with the NCBI search builder but wasn't able to find something which would get me what I need.

Have I overlooked something?

In case it is not possible to get what I need from the NCBI search, I thought of downloading the sequences for the initial "polyketide" search as GenBank formatted entries and then perform some own filtering. Are there any easy to use functionalities in Bioperl or EMBOSS which would do that without much programming?

Best, B.

ncbi search filtering genbank bioperl • 3.7k views
ADD COMMENT
1
Entering edit mode

did you try to narrow your search using an ENTREZ field .e.g: http://www.ncbi.nlm.nih.gov/nuccore?db=nuccore&cmd=search&term=polyketide[TITL]

ADD REPLY
0
Entering edit mode

Yes, see my comment to Neil's answer.

ADD REPLY
2
Entering edit mode
11.3 years ago
Neilfws 49k

There are a couple of ways to approach this.

First, try:

polyketide[Title]

as your search. The qualifier "Title" searches for words in the DEFINITION line of the Genpept record. For example:

DEFINITION  Polyketide synthetase MbtC (polyketide synthase) [Mycobacterium
            canettii CIPT 140070017].

Second: yes, you could quite easily filter records using one of the Bio* project libraries, such as Bioperl, depending on what you call "not much programming." For example, you could search for "polyketide" in the /product tag of either the Genpept or the corresponding Genbank entry, for example:

CDS             1185534..1187168
                /gene="pks"
                /locus_tag="BN45_30045"
                /EC_number="2.3.1.86"
                /inference="ab initio prediction:AMIGene:2.0"
                /note="Evidence 3 : Function proposed based on presence of
                 conserved amino acid motif, structural feature or limited
                 homology; PubMedId : 11929527, 15525680; Product type pe :
                 putative enzyme"
                /codon_start=1
                /transl_table=11
                /product="Putative polyketide synthase Pks16"
                /protein_id="CCK63145.1"

See the Bioperl feature annotation HOWTO.

ADD COMMENT
0
Entering edit mode

Searching via TITLE is a too strong restriction as it misses entries like EGC38264.1 (Title: "hypothetical protein DICPUDRAFT_149064 [Dictyostelium purpureum]") which does not have polyketide in the title but a nicely annotated polyketide synthase in the FEATURES section. I suppose I'll have to write a quick Bioperl script though I had hoped this to be a frequent enough problem for someone already having done this (or the NCBI having something I overlooked).

ADD REPLY
0
Entering edit mode

If you are looking for members of a particular protein family, you might consider parsing for annotated domains (such as InterPro features), or searching databases for proteins with the relevant domain(s). If you just want, as you stated initially, "something to do with polyketides", well that is a more difficult problem and you will have to deal with the ambiguity.

ADD REPLY

Login before adding your answer.

Traffic: 1875 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6