How to determine the terminal GO terms (terminal nodes within GO DAG) within GO terms of interest?
3
4
Entering edit mode
9.9 years ago
oussumenten ▴ 40

I have a set of genes with GO terms assigned to the genes. I want only the terminal GO terms for each gene (terminal nodes within GO DAG). For example the gene "PFC0155c" has the following GO terms assigned to it:

GO:0003677    F    DNA binding
GO:0003899    F    DNA-directed RNA polymerase activity
GO:0005198    F    structural molecule activity
GO:0005622    C    intracellular
GO:0005623    C    cell
GO:0005634    C    nucleus
GO:0005665    C    DNA-directed RNA polymerase II, core complex
GO:0005730    C    nucleolus
GO:0006139    P    nucleobase-containing compound metabolic process
GO:0006351    P    transcription, DNA-templated
GO:0006366    P    transcription from RNA polymerase II promoter
GO:0006725    P    cellular aromatic compound metabolic process
GO:0006807    P    nitrogen compound metabolic process
GO:0008152    P    metabolic process
GO:0009058    P    biosynthetic process
GO:0009059    P    macromolecule biosynthetic process
GO:0009987    P    cellular process
GO:0010467    P    gene expression
GO:0016070    P    RNA metabolic process
GO:0018130    P    heterocycle biosynthetic process
GO:0019438    P    aromatic compound biosynthetic process
GO:0031974    C    membrane-enclosed lumen
GO:0031981    C    nuclear lumen
GO:0032774    P    RNA biosynthetic process
GO:0032991    C    macromolecular complex
GO:0034641    P    cellular nitrogen compound metabolic process
GO:0034645    P    cellular macromolecule biosynthetic process
GO:0034654    P    nucleobase-containing compound biosynthetic process
GO:0043170    P    macromolecule metabolic process
GO:0043226    C    organelle
GO:0043227    C    membrane-bounded organelle
GO:0043229    C    intracellular organelle
GO:0043231    C    intracellular membrane-bounded organelle
GO:0043233    C    organelle lumen
GO:0043234    C    protein complex
GO:0044237    P    cellular metabolic process
GO:0044238    P    primary metabolic process
GO:0044249    P    cellular biosynthetic process
GO:0044260    P    cellular macromolecule metabolic process
GO:0044271    P    cellular nitrogen compound biosynthetic process
GO:0044422    C    organelle part
GO:0044424    C    intracellular part
GO:0044428    C    nuclear part
GO:0044446    C    intracellular organelle part
GO:0044464    C    cell part
GO:0046483    P    heterocycle metabolic process
GO:0070013    C    intracellular organelle lumen
GO:0071704    P    organic substance metabolic process
GO:0090304    P    nucleic acid metabolic process
GO:1901360    P    organic cyclic compound metabolic process
GO:1901362    P    organic cyclic compound biosynthetic process
GO:1901576    P    organic substance biosynthetic process
GO:1902494    C    catalytic complex
GO:1990234    C    transferase complex

When the above assigned GO terms are visualize in GO DAG (see picture below, colored boxes represent GO terms assigned to gene "PFC0155c"), there are 6 terminal GO terms (Reddish Ovals in picture) for the Biological Processes, Cellular Components, and Molecular Functions ontologies:

PFC0155c

GO:0003677    F    DNA binding
GO:0005198    F    structural molecule activity
GO:0003899    F    DNA-directed RNA polymerase activity
GO:0006366    P    transcription from RNA polymerase II promoter
GO:0005730    C    nucleolus
GO:0005665    C    DNA-directed RNA polymerase II, core complex

The 6 terminal GO terms are the ones I want from the initial assigned list. I have 1600+ genes for which I want terminal GO terms from assigned GO terms.

Is there a way to automate this? Any ideas are welcomed. I would prefer to keep the association.

Thanks.

GO-DAG Terminal-node GO-terms • 5.4k views
ADD COMMENT
7
Entering edit mode
9.9 years ago
Martin Morgan ★ 1.6k

Bioconductor has 'annotation' packages, including GO.db to represent the GO DAG. Here we install it (once)

source("http://bioconductor.org")
biocLite("GO.db")

then load the package for use

library(GO.db)

If we have a vector of GO ids from a particular ontology, e.g., from the Cellular Compartment (CC) ontology

terms = c("GO:0005622", "GO:0005623", "GO:0005634", "GO:0005665", "GO:0005730", 
  "GO:0031974", "GO:0031981", "GO:0032991", "GO:0043226", "GO:0043227", 
  "GO:0043229", "GO:0043231", "GO:0043233", "GO:0043234", "GO:0044422", 
  "GO:0044424", "GO:0044428", "GO:0044446", "GO:0044464", "GO:0070013", 
  "GO:1902494", "GO:1990234")

Then the following function finds the terminal nodes

terminal =
    function(terms, ontology=c("C", "P", "F"))
{
    FUN <- switch(match.arg(ontology),  C=GOCCPARENTS,
        P=GOBPPARENTS, F=GOMFPARENTS)
    terminal <- terms
    seen <- c(terms, "all")
    while (length(terms)) {
        seen <- c(terms, seen)
        terms <- mappedRkeys(FUN[terms])
        terminal <- terminal[!terminal %in% terms]
        terms <- terms[!terms %in% seen]
    }
    terminal
}

For instance,

> terminal(terms, "CC")
[1] "GO:0005665" "GO:0005730"

The function works by looking up the parents of each term (mappedRkeys(GOCCPARENTS[terms])) and then excluding from any of the original terms that appears as a parent. The switch() statement allows handling of different ontologies. The GO.db could be queried for other information, e.g., the description of each term

for (term in terminal(terms, "CC")) print(GOTERM[[term]])

To process the entire data, I arranged to create a data.frame (using read.delim(); it's a little tricky because of the combination of white space and commas in the term description column) with three columns corresponding to the above data

> head(df)
          V1 V2                                   V3
1 GO:0003677  F                          DNA binding
2 GO:0003899  F DNA-directed RNA polymerase activity
3 GO:0005198  F         structural molecule activity
4 GO:0005622  C                        intracellular
5 GO:0005623  C                                 cell
6 GO:0005634  C                              nucleus

I then split the GO terms by ontology, and applied the terminal() function to each ontology

goByOnto <- split(df$V1, df$V2)
goids <- Map(terminal, goByOnto, names(goByOnto))

and created a table summarizing the results

result <- select(GO.db, goids, c("ONTOLOGY", "TERM"))

These are

> result
        GOID ONTOLOGY                                          TERM
1 GO:0005665       CC  DNA-directed RNA polymerase II, core complex
2 GO:0005730       CC                                     nucleolus
3 GO:0003677       MF                                   DNA binding
4 GO:0003899       MF          DNA-directed RNA polymerase activity
5 GO:0005198       MF                  structural molecule activity
6 GO:0006366       BP transcription from RNA polymerase II promoter

and can be saved to a file with write.table(result, "my.txt").

ADD COMMENT
0
Entering edit mode

Hi Martin,

This is helpful, it should work. So for CC=GOCCPARENTS, BP=GOBPPARENTS, MF=GOMFPARENTS, is this for each GO terms and their ancestral/parent GO terms? If so can I get this information in a file?

Thanks.

ADD REPLY
0
Entering edit mode

How did you created the object "db"?

ADD REPLY
3
Entering edit mode
9.9 years ago

I wrote a XSLT stylesheet that count the number of children ("is_a") for each go term. It is slow but it simple and does the job:

https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/go/go2countchildren.xsl

curl  "http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz" |\
    gunzip -c |\
    xsltproc --novalid go2countchildren.xsl go.rdf - > count.tsv

result:

#ACN    NAME    CHILDREN
GO:0000001    mitochondrion inheritance    0
GO:0000002    mitochondrial genome maintenance    1
GO:0000003    reproduction    4
GO:0000005    ribosomal chaperone activity    0
GO:0042254    ribosome biogenesis    0
GO:0044183    protein binding involved in protein folding    0
GO:0051082    unfolded protein binding    0
GO:0000006    high affinity zinc uptake transmembrane transporter activity    0
GO:0000007    low-affinity zinc ion transmembrane transporter activity    0
GO:0000008    thioredoxin    0
GO:0003756    protein disulfide isomerase activity    0
GO:0015036    disulfide oxidoreductase activity    2

the lines ending with '0' are the terminal terms. E.g: http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0003756#term=children

you can cross this information with your files using linux/join.

P.

ADD COMMENT

Login before adding your answer.

Traffic: 2195 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6