Question

How to determine the terminal GO terms (terminal nodes within GO DAG) within GO terms of interest?

4

Entering edit mode

9.9 years ago

oussumenten ▴ 40

I have a set of genes with GO terms assigned to the genes. I want only the terminal GO terms for each gene (terminal nodes within GO DAG). For example the gene "PFC0155c" has the following GO terms assigned to it:

GO:0003677    F    DNA binding
GO:0003899    F    DNA-directed RNA polymerase activity
GO:0005198    F    structural molecule activity
GO:0005622    C    intracellular
GO:0005623    C    cell
GO:0005634    C    nucleus
GO:0005665    C    DNA-directed RNA polymerase II, core complex
GO:0005730    C    nucleolus
GO:0006139    P    nucleobase-containing compound metabolic process
GO:0006351    P    transcription, DNA-templated
GO:0006366    P    transcription from RNA polymerase II promoter
GO:0006725    P    cellular aromatic compound metabolic process
GO:0006807    P    nitrogen compound metabolic process
GO:0008152    P    metabolic process
GO:0009058    P    biosynthetic process
GO:0009059    P    macromolecule biosynthetic process
GO:0009987    P    cellular process
GO:0010467    P    gene expression
GO:0016070    P    RNA metabolic process
GO:0018130    P    heterocycle biosynthetic process
GO:0019438    P    aromatic compound biosynthetic process
GO:0031974    C    membrane-enclosed lumen
GO:0031981    C    nuclear lumen
GO:0032774    P    RNA biosynthetic process
GO:0032991    C    macromolecular complex
GO:0034641    P    cellular nitrogen compound metabolic process
GO:0034645    P    cellular macromolecule biosynthetic process
GO:0034654    P    nucleobase-containing compound biosynthetic process
GO:0043170    P    macromolecule metabolic process
GO:0043226    C    organelle
GO:0043227    C    membrane-bounded organelle
GO:0043229    C    intracellular organelle
GO:0043231    C    intracellular membrane-bounded organelle
GO:0043233    C    organelle lumen
GO:0043234    C    protein complex
GO:0044237    P    cellular metabolic process
GO:0044238    P    primary metabolic process
GO:0044249    P    cellular biosynthetic process
GO:0044260    P    cellular macromolecule metabolic process
GO:0044271    P    cellular nitrogen compound biosynthetic process
GO:0044422    C    organelle part
GO:0044424    C    intracellular part
GO:0044428    C    nuclear part
GO:0044446    C    intracellular organelle part
GO:0044464    C    cell part
GO:0046483    P    heterocycle metabolic process
GO:0070013    C    intracellular organelle lumen
GO:0071704    P    organic substance metabolic process
GO:0090304    P    nucleic acid metabolic process
GO:1901360    P    organic cyclic compound metabolic process
GO:1901362    P    organic cyclic compound biosynthetic process
GO:1901576    P    organic substance biosynthetic process
GO:1902494    C    catalytic complex
GO:1990234    C    transferase complex

When the above assigned GO terms are visualize in GO DAG (see picture below, colored boxes represent GO terms assigned to gene "PFC0155c"), there are 6 terminal GO terms (Reddish Ovals in picture) for the Biological Processes, Cellular Components, and Molecular Functions ontologies:

GO:0003677    F    DNA binding
GO:0005198    F    structural molecule activity
GO:0003899    F    DNA-directed RNA polymerase activity
GO:0006366    P    transcription from RNA polymerase II promoter
GO:0005730    C    nucleolus
GO:0005665    C    DNA-directed RNA polymerase II, core complex

The 6 terminal GO terms are the ones I want from the initial assigned list. I have 1600+ genes for which I want terminal GO terms from assigned GO terms.

Is there a way to automate this? Any ideas are welcomed. I would prefer to keep the association.

Thanks.

GO-DAG Terminal-node GO-terms • 5.4k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by oussumenten ▴ 40

Ram · Answer 1 · 2014-06-06

Bioconductor has 'annotation' packages, including GO.db to represent the GO DAG. Here we install it (once)

source("http://bioconductor.org")
biocLite("GO.db")

then load the package for use

library(GO.db)

If we have a vector of GO ids from a particular ontology, e.g., from the Cellular Compartment (CC) ontology

terms = c("GO:0005622", "GO:0005623", "GO:0005634", "GO:0005665", "GO:0005730", 
  "GO:0031974", "GO:0031981", "GO:0032991", "GO:0043226", "GO:0043227", 
  "GO:0043229", "GO:0043231", "GO:0043233", "GO:0043234", "GO:0044422", 
  "GO:0044424", "GO:0044428", "GO:0044446", "GO:0044464", "GO:0070013", 
  "GO:1902494", "GO:1990234")

Then the following function finds the terminal nodes

terminal =
    function(terms, ontology=c("C", "P", "F"))
{
    FUN <- switch(match.arg(ontology),  C=GOCCPARENTS,
        P=GOBPPARENTS, F=GOMFPARENTS)
    terminal <- terms
    seen <- c(terms, "all")
    while (length(terms)) {
        seen <- c(terms, seen)
        terms <- mappedRkeys(FUN[terms])
        terminal <- terminal[!terminal %in% terms]
        terms <- terms[!terms %in% seen]
    }
    terminal
}

For instance,

> terminal(terms, "CC")
[1] "GO:0005665" "GO:0005730"

The function works by looking up the parents of each term (mappedRkeys(GOCCPARENTS[terms])) and then excluding from any of the original terms that appears as a parent. The switch() statement allows handling of different ontologies. The GO.db could be queried for other information, e.g., the description of each term

for (term in terminal(terms, "CC")) print(GOTERM[[term]])

To process the entire data, I arranged to create a data.frame (using read.delim(); it's a little tricky because of the combination of white space and commas in the term description column) with three columns corresponding to the above data

> head(df)
          V1 V2                                   V3
1 GO:0003677  F                          DNA binding
2 GO:0003899  F DNA-directed RNA polymerase activity
3 GO:0005198  F         structural molecule activity
4 GO:0005622  C                        intracellular
5 GO:0005623  C                                 cell
6 GO:0005634  C                              nucleus

I then split the GO terms by ontology, and applied the terminal() function to each ontology

goByOnto <- split(df$V1, df$V2)
goids <- Map(terminal, goByOnto, names(goByOnto))

and created a table summarizing the results

result <- select(GO.db, goids, c("ONTOLOGY", "TERM"))

These are

> result
        GOID ONTOLOGY                                          TERM
1 GO:0005665       CC  DNA-directed RNA polymerase II, core complex
2 GO:0005730       CC                                     nucleolus
3 GO:0003677       MF                                   DNA binding
4 GO:0003899       MF          DNA-directed RNA polymerase activity
5 GO:0005198       MF                  structural molecule activity
6 GO:0006366       BP transcription from RNA polymerase II promoter

and can be saved to a file with write.table(result, "my.txt").

Ram · Answer 2 · 2014-06-06

I wrote a XSLT stylesheet that count the number of children ("is_a") for each go term. It is slow but it simple and does the job:

https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/go/go2countchildren.xsl

curl  "http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz" |\
    gunzip -c |\
    xsltproc --novalid go2countchildren.xsl go.rdf - > count.tsv

result:

#ACN    NAME    CHILDREN
GO:0000001    mitochondrion inheritance    0
GO:0000002    mitochondrial genome maintenance    1
GO:0000003    reproduction    4
GO:0000005    ribosomal chaperone activity    0
GO:0042254    ribosome biogenesis    0
GO:0044183    protein binding involved in protein folding    0
GO:0051082    unfolded protein binding    0
GO:0000006    high affinity zinc uptake transmembrane transporter activity    0
GO:0000007    low-affinity zinc ion transmembrane transporter activity    0
GO:0000008    thioredoxin    0
GO:0003756    protein disulfide isomerase activity    0
GO:0015036    disulfide oxidoreductase activity    2

the lines ending with '0' are the terminal terms. E.g: http://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0003756#term=children

you can cross this information with your files using linux/join.

P.