R's GO.db and Python's go_db gives different ancestors, which is correct?
1
2
Entering edit mode
9.1 years ago

This is a long post; I am trying to find out why R's GO.db gives different results than Python's go_db https://github.com/endrebak/go_db.

go_db seems to give different results than the R package GO.db.

go_db computes ancestors by looking upwards in the tree; so that C is an ancestor of A if A is a C, A is part of C or C has part A or any such relation exists transitively.

Since go_db is a dinky script I threw together quickly and GO.db is the seemingly canonical Gene Ontology package I am guessing I am in the wrong, but if anyone can explain why, I would appreciate it. It is probably my usage of GO.db that is incorrect, I'd wager.

To run the scripts below you need to install wide_diapeR and go_db with

pip install -U go_db wide_diaper

and have the R package GO.db installed

from __future__ import print_function
from widediaper import R
from go_db import get_offspring

r = R()
r.load_library("GO.db")
r("cc_offspring = as.data.frame(GOCCOFFSPRING)")
r_cc_offspring = r.get("cc_offspring")
r_cc_offspring.columns = ["Offspring", "Parent"]

py_cc_offspring = get_offspring("CC")

py_go_0001533 = py_cc_offspring.loc[py_cc_offspring.Offspring == "GO:0001533"]
r_go_0001533 = r_cc_offspring.loc[r_cc_offspring.Offspring == "GO:0001533"]

# Parents in GO.db, but not go_db
set(r_go_0001533.Parent) - set(py_go_0001533.Parent)
# {'GO:0005622',
#  'GO:0005856',
#  'GO:0043226',
#  'GO:0043228',
#  'GO:0043229',
#  'GO:0043232',
#  'GO:0044424'}

# Parents in go_db, but not GO.db
set(py_go_0001533.Parent) - set(r_go_0001533.Parent)
# {'GO:0005886',
#  'GO:0012505',
#  'GO:0016020',
#  'GO:0030313',
#  'GO:0031975',
#  'GO:0071944',
#  'GO:0097653'}

As you can see above, there are great differences in what nodes the two packages consider to be ancestors of "GO:0001533".

Let us look at the go.obo file and see if we can make sense of this. The script below prints out all the ancestors of a node (including the node itself):

# download obo file from http://purl.obolibrary.org/obo/go.obo
obo_file = '/Users/labsenter/.go_db/go.obo' # Use the path to whereever you are storing your .obo file
obo_records = "".join(open(obo_file).readlines()).split("\n\n[Term]")[1:-1]

py_ancestors = py_cc_offspring.loc[py_cc_offspring.Offspring == "GO:0001533"].Parent
py_ancestors.append("GO:0001533")
py_ancestor_search_terms = ["id: " + ancestor for ancestor in py_ancestors]

r_ancestors = list(py_cc_offspring.loc[py_cc_offspring.Offspring == "GO:0001533"].Parent)
r_ancestors.append("GO:0001533")
r_ancestor_search_terms = ["id: " + ancestor for ancestor in r_ancestors]

def _print_records(search_terms):
    for record in obo_records:
        for search_term in search_terms:
            if search_term in record:
                for line in record.split("\n"):
                    for term in ["id:", "is_a:", "relationship:", "intersection_of:"]:
                        if term in line:
                            print(line)
                print()

print("### Py ancestors\n")
_print_records(py_ancestor_search_terms)

print("### R ancestors\n")
_print_records(r_ancestor_search_terms)

The script above gives the following results:

### Py ancestors

id: GO:0001533
is_a: GO:0005886 ! plasma membrane

id: GO:0005575
alt_id: GO:0008372

id: GO:0005623
is_a: GO:0005575 ! cellular_component

id: GO:0005886
alt_id: GO:0005904
is_a: GO:0016020 ! membrane
is_a: GO:0044464 ! cell part
relationship: part_of GO:0071944 ! cell periphery

id: GO:0012505
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005773 ! vacuole
relationship: has_part GO:0005886 ! plasma membrane

id: GO:0016020
is_a: GO:0005575 ! cellular_component

id: GO:0030313
is_a: GO:0031975 ! envelope
relationship: has_part GO:0005886 ! plasma membrane

id: GO:0031975
is_a: GO:0044464 ! cell part

id: GO:0044464
is_a: GO:0005575 ! cellular_component
intersection_of: GO:0005575 ! cellular_component
intersection_of: part_of GO:0005623 ! cell
relationship: part_of GO:0005623 ! cell

id: GO:0071944
is_a: GO:0044464 ! cell part

id: GO:0097653
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005622 ! intracellular
relationship: has_part GO:0005886 ! plasma membrane

### R ancestors

id: GO:0001533
is_a: GO:0005886 ! plasma membrane

id: GO:0005575
alt_id: GO:0008372

id: GO:0005623
is_a: GO:0005575 ! cellular_component

id: GO:0005886
alt_id: GO:0005904
is_a: GO:0016020 ! membrane
is_a: GO:0044464 ! cell part
relationship: part_of GO:0071944 ! cell periphery

id: GO:0012505
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005773 ! vacuole
relationship: has_part GO:0005886 ! plasma membrane

id: GO:0016020
is_a: GO:0005575 ! cellular_component

id: GO:0030313
is_a: GO:0031975 ! envelope
relationship: has_part GO:0005886 ! plasma membrane

id: GO:0031975
is_a: GO:0044464 ! cell part

id: GO:0044464
is_a: GO:0005575 ! cellular_component
intersection_of: GO:0005575 ! cellular_component
intersection_of: part_of GO:0005623 ! cell
relationship: part_of GO:0005623 ! cell

id: GO:0071944
is_a: GO:0044464 ! cell part

id: GO:0097653
is_a: GO:0044464 ! cell part
relationship: has_part GO:0005622 ! intracellular
relationship: has_part GO:0005886 ! plasma membrane

Since you can see that GO:0001533 is a GO:0005886 which again is a GO:0016020, GO:0016020 should be considered an ancestor of GO:0001533. Unfortunately, that is not the case in GO.db. If anyone knows why, please explain!

Gene-Ontology • 3.2k views
ADD COMMENT
2
Entering edit mode
9.1 years ago

It seems like GO.db only looks at the relationships "is_a" and "part_of" to compute ancestors.

> library(GO.db)
> cc_offspring = as.data.frame(GOCCOFFSPRING)
> colnames(cc_offspring) = c("Offspring", "Parent")
> cc_offspring[cc_offspring["Offspring"] == "GO:0001533",]
       Offspring     Parent
1     GO:0001533 GO:0005886
3404  GO:0001533 GO:0005575
7659  GO:0001533 GO:0016020
9800  GO:0001533 GO:0005623
45693 GO:0001533 GO:0044464
49577 GO:0001533 GO:0071944

If you only use these in go_db, you get the same result:

> from go_db import get_offspring
> cco = get_offspring("CC", ["is_a", "part_of"])
> cco.loc[cco.Offspring == "GO:0001533"]
        Offspring      Parent
722    GO:0001533  GO:0005886
5913   GO:0001533  GO:0016020
6019   GO:0001533  GO:0044464
6136   GO:0001533  GO:0071944
29758  GO:0001533  GO:0005575
30078  GO:0001533  GO:0005623

Will update the docs.

ADD COMMENT

Login before adding your answer.

Traffic: 1026 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6