Can anybody recommend a GO term redundancy removing stand alone software like the REVIGO Web server? Is there any R, Perl or Python package available for this kind of work?
Thanks
Can anybody recommend a GO term redundancy removing stand alone software like the REVIGO Web server? Is there any R, Perl or Python package available for this kind of work?
Thanks
You can use tools for semantic similarity computations to reduce redundancy of GO terms associated with your genes of interest. This can be done using following approaches.
Approach 1: Use a semantic similarity algorithm (See this review if you are new to the concepts and section on slimmer tools provided by GO team), compute the similarities and retain codes with optimal diversity. GO terms are defined using a di-acyclic graph data structure, a semantic similarity measure (ranges between 0-1; higher the better) can be obtained between GO terms of two genes. Several R packages are available (See GOSemSim or GOSim). You may need to compute all possible combinations and this could be exhaustive depend on the number of annotations available for your gene(s) or use clustering approach as in REViGO or use any standard data mining algorithms that fits redundancy criteria you need to reduce the terms.
Approach 2: Use a pre-curated "lite" version of ontology (GO-Lite). This is a reduced representation of GO terms semi-automatically curated by GO team. Using the lite version instead of full version can help you to find more specific terms.
Some times removing redundant terms genes will lead to just a global overview of processes, functions or compartments associated with your gene(s) (IMHO, GO terms are not redundant - they are highly specific terms :)).You may use appropriate approach after evaluating your need.
Thanks KS. By GO-Lite do you mean GO-slim category? I tried to import GO-slim using BiomaRt package in R, using a command like this:
results.slim <- getBM(attributes = c('refseq_mrna','goslim_goa_accession','goslim_goa_description'), filters = 'refseq_mrna',values = refseq.id,mart = ensembl )
which produces an output like the following. I'm interested in Biological Process only and is there the any way to pick up the broadest or most specific term from these GO-Slim terms:
1 NM_011157 GO:0008150 biological_process
2 NM_011157 GO:0003674 molecular_function
3 NM_011157 GO:0005575 cellular_component
4 NM_011157 GO:0005622 intracellular
5 NM_011157 GO:0005623 cell
6 NM_011157 GO:0005737 cytoplasm
7 NM_011157 GO:0043226 organelle
8 NM_011157 GO:0005794 Golgi apparatus
9 NM_011157 GO:0005773 vacuole
10 NM_011157 GO:0048856 anatomical structure development
11 NM_011157 GO:0005576 extracellular region
12 NM_011157 GO:0008219 cell death
13 NM_011157 GO:0016023 cytoplasmic membrane-bounded vesicle
14 NM_011157 GO:0051604 protein maturation
15 NM_011157 GO:0005615 extracellular space
16 NM_011157 GO:0005764 lysosome
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.