Go Terms Redundancy Remover
2
2
Entering edit mode
8.9 years ago
Woa ★ 2.8k

Can anybody recommend a GO term redundancy removing stand alone software like the REVIGO Web server? Is there any R, Perl or Python package available for this kind of work?

Thanks

go ontology • 6.7k views
4
Entering edit mode
8.9 years ago

You can use tools for semantic similarity computations to reduce redundancy of GO terms associated with your genes of interest. This can be done using following approaches.

Approach 1: Use a semantic similarity algorithm (See this review if you are new to the concepts and section on slimmer tools provided by GO team), compute the similarities and retain codes with optimal diversity. GO terms are defined using a di-acyclic graph data structure, a semantic similarity measure (ranges between 0-1; higher the better) can be obtained between GO terms of two genes. Several R packages are available (See GOSemSim or GOSim). You may need to compute all possible combinations and this could be exhaustive depend on the number of annotations available for your gene(s) or use clustering approach as in REViGO or use any standard data mining algorithms that fits redundancy criteria you need to reduce the terms.

Approach 2: Use a pre-curated "lite" version of ontology (GO-Lite). This is a reduced representation of GO terms semi-automatically curated by GO team. Using the lite version instead of full version can help you to find more specific terms.

Some times removing redundant terms genes will lead to just a global overview of processes, functions or compartments associated with your gene(s) (IMHO, GO terms are not redundant - they are highly specific terms :)).You may use appropriate approach after evaluating your need.

1
Entering edit mode
8.9 years ago
Woa ★ 2.8k

Thanks KS. By GO-Lite do you mean GO-slim category? I tried to import GO-slim using BiomaRt package in R, using a command like this:

results.slim <- getBM(attributes = c('refseq_mrna','goslim_goa_accession','goslim_goa_description'), filters = 'refseq_mrna',values = refseq.id,mart = ensembl )


which produces an output like the following. I'm interested in Biological Process only and is there the any way to pick up the broadest or most specific term from these GO-Slim terms:

 1    NM_011157           GO:0008150                   biological_process
2    NM_011157           GO:0003674                   molecular_function
3    NM_011157           GO:0005575                   cellular_component
4    NM_011157           GO:0005622                        intracellular
5    NM_011157           GO:0005623                                 cell
6    NM_011157           GO:0005737                            cytoplasm
7    NM_011157           GO:0043226                            organelle
8    NM_011157           GO:0005794                      Golgi apparatus
9    NM_011157           GO:0005773                              vacuole
10   NM_011157           GO:0048856     anatomical structure development
11   NM_011157           GO:0005576                 extracellular region
12   NM_011157           GO:0008219                           cell death
13   NM_011157           GO:0016023 cytoplasmic membrane-bounded vesicle
14   NM_011157           GO:0051604                   protein maturation
15   NM_011157           GO:0005615                  extracellular space
16   NM_011157           GO:0005764                             lysosome