Question: Identical Queries On Ncbi Gene And Protein Databases Returns Fewer Results From The Gene Database.
gravatar for Michael Barton
7.4 years ago by
Michael Barton1.8k
Akron, Ohio, United States
Michael Barton1.8k wrote:

Possibly a stupid question but I'm not getting the same results for the same query on the NCBI protein database versus with gene database. I get a larger set for the query on the protein DB compared with the same query on the gene DB. I assume that each protein result must have a corresponding gene entry in the database? Any idea how to get the nucleotide sequences for the genes using my query? Here is the query string:

gyrB[Gene] OR (DNA gyrase subunit B[Protein]) AND Pseudomonas[Primary Organism] NOT partial

ncbi database search • 2.1k views
ADD COMMENTlink modified 2.5 years ago by Biostar ♦♦ 20 • written 7.4 years ago by Michael Barton1.8k
gravatar for Neilfws
7.4 years ago by
Sydney, Australia
Neilfws48k wrote:

First, you cannot use the same query on both databases, because they use different terms. PROT (Protein) and PORG (Primary Organism) are specific to the Protein database. The equivalent terms for the Gene database might be TITL (Gene/Protein Name) or PFN (Protein Full Name) and ORGN (Organism). See this list of Entrez databases and their terms.

If you run the Protein query (145 results), then look on the right side of the page for "Find related data", choose "Gene" and "Find items", 33 results are returned. This is almost the same number as for your Gene query (34). So there is some kind of mapping between the two.

I would not necessarily expect each protein to have a corresponding gene, or vice-versa; it all depends on how each database is curated and maintained. Probably best to read up on the documentation for each of the databases, to see if there's any mention of potential causes for discrepancy.

ADD COMMENTlink written 7.4 years ago by Neilfws48k

Thanks Neil. I resorted to writing a script to parse the gene out of the protein genbank file then fetch that from the database.

ADD REPLYlink written 7.4 years ago by Michael Barton1.8k

Nice response! Where'd you get that (excellent) list? Is it updated regularly? Who maintains it? Where does the underlying data come from?

ADD REPLYlink written 7.3 years ago by Chris Maloney330
gravatar for Will
7.4 years ago by
United States
Will4.5k wrote:

Probably because there are multiple protein entries for each gene .... ie. alternate splicing variants.

ADD COMMENTlink written 7.4 years ago by Will4.5k

Many species returned in the protein set are not present in gene set.

ADD REPLYlink written 7.4 years ago by Michael Barton1.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 974 users visited in the last hour