Question: is gene2go incomplete?
gravatar for sinifdosyalari12h
8 months ago by
sinifdosyalari12h20 wrote:

I am using the gene2go file to obtain go terms and the entrez gene ids related to those go terms.

To make sure that gene2go was complete I had a program count the go terms with tax id 9606 and the result was around 17k . When I ran an sql query on the GO database to see how many GO terms related to human there are the result was around 19k.

I then compared these two datasets. They shared the majority of their terms but not all.

First I thought some of the terms that the database gave me didn't have any genes annotated to them and that is why those terms weren't in the gene2go file.

So I took some terms that are available in the database but not in the gene2go to see if that was the case.

One of those terms was GO:0051503.I ran an sql query on the GO database to see if there were any genes annotated to this term. The sql query is given below:

 gene_product.symbol AS gp_symbol,
 gene_product.symbol AS gp_full_name
FROM term
 INNER JOIN association ON
 INNER JOIN gene_product ON (
 INNER JOIN species ON (
 INNER JOIN dbxref ON (
 term.acc = 'GO:0051503   '
species.ncbi_taxa_id = '9606';

This query returns the gene with symbol SLC25A23 which has entrez gene id 79085.

But when I look at the latest gene2go file there is no row with tax id 9606,go id 0051503 and gene id 79085. The part where this entry should be is as below:

taxID entrez goID
9606 79084 GO:1903508
9606 79085 GO:0002082
9606 79085 GO:0005347
9606 79085 GO:0005509
9606 79085 GO:0005515
9606 79085 GO:0005739
9606 79085 GO:0006851
9606 79085 GO:0015866
9606 79085 GO:0015867
9606 79085 GO:0036444
9606 79085 GO:0043457
9606 79085 GO:0051282
9606 79085 GO:0071277
9606 79085 GO:0097274
9606 79086 GO:0016021

As you can see the goID 0051503 is not there .

Can anyone explain why is that .I tried using a local go database which I set up 2 days ago and I also tried the GOOSE tool of Amigo to run my sql queries.Therefore It can't be explained by database being out of date. The only explanation seems to be that gene2go is incomplete but that doesn't make sense.

ADD COMMENTlink modified 7 months ago by Biostar ♦♦ 20 • written 8 months ago by sinifdosyalari12h20

No one can claim that databases are going to be complete. There may be processing glitches (not humanly possible to check each row), errors (humans are involved at steps or data retrieval programs may be the cause) or simply no/missing information in one of the databases. I don't think your users can blame you as long as you keep a record of where the data is coming from and what you are doing to process/present it.

You can send a ticket into NCBI Help desk with this diagnostic information and see what they say. It may take them 2-4 business days to respond (they must get a ton of requests from all around the world) but they do respond.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax67k

but I don't think that this problem is limited to this go ID. I hava found at least another 60 GO terms that return a gene in the sql query but not return a line from the gene2go file. But maybe 60 is still an acceptable margin of error I'm not sure.

ADD REPLYlink written 8 months ago by sinifdosyalari12h20

This isn't a margin of error. There is simply no data for those terms in gene2go. Those must still represent a very small fraction of total number you have, correct?

I took a look at the example you provided.

9606 is human taxID. 79085 is the entry of SLC25A23 gene in NCBI database. That page says the GO terms come from GOA (which is provided by Ensembl). Ensembl page for the gene does list GO:0051503 on this gene page. (Click on GO:Biological Process in the left navigation pane.( So all the information is there, just not in the gene2go as yet.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax67k

So there is no problem with me using gene2go and sometimes it takes a few days for gene2go to catch up with the latest information?

ADD REPLYlink written 8 months ago by sinifdosyalari12h20

It should be ok to use gene2go. You may want to send the 60 ID's in for NCBI to see if they just happen to be missing by chance/error. At least one of them checks out as noted above.

ADD REPLYlink modified 8 months ago • written 8 months ago by genomax67k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1980 users visited in the last hour