Question

Comparing lists of GOs (not genes)

0

Entering edit mode

4.4 years ago

dpearton • 0

Hello,

I have conducted an RNASeq experiment with a non-model system with a reference trascriptome that I have annotated with GO terms. In the end, what I have is a number of lists of GO terms (derived from the reference) that are differentially expressed between the various conditions (six conditions).

I am looking for a tool that will compare the relative presence of GO categories (and look for enrichment) in the different conditions compared to the distribution of GO categories in the reference genome.

I had the idea that what I am trying to do should be relatively simple, but I am finding it very difficult to find a tool that will do this directly on GO terms. They all ask for lists of genes (or uniprot IDs, etc) from which they derive a list of GOs and then do the analysis. I would like to skip the first step and work directly on the GOs (I have already done the selection based on PD, FDR, fold change, etc).

Does anyone know of a website, R package, etc that accepts lists of GOs rather than gene lists as an input?

Thank you.

GO; RNA-Seq • 1.4k views

ADD COMMENT • link updated 4.4 years ago by Lluís R. ★ 1.2k • written 4.4 years ago by dpearton • 0

0

Entering edit mode

What you're asking is not clear. If you derived lists of GO terms from comparing gene expression between different conditions, you should have a list for each pair of conditions you've compared, not a list per condition. Could you clarify what you're doing and aiming at? Also have a look at this previous post on the same topic.

ADD REPLY • link 4.4 years ago by Jean-Karim Heriche 27k

score 1 · Answer 1 · 2019-11-26

1

Entering edit mode

4.4 years ago

Lluís R. ★ 1.2k

You can use GOSemSim to compare list of GO. It compares them based on semantic similarity of GO terms.

ADD COMMENT • link 4.4 years ago by Lluís R. ★ 1.2k

score 0 · Answer 2 · 2019-11-26

0

Entering edit mode

4.4 years ago

dpearton • 0

Thank you for your reply,

Do not worry about the fact that I have multiple lists of genes - that is a consequence of the fact that I am looking at (a) multiple stressors and (b) multiple levels of some of the stressors. While these can be looked at as a series of pairwise comparisons an alternative way of looking at it is to look at patterns of gene expression over the different conditions. This gives a list of genes (from which I have derived lists of GOs) and I would like to compare each list of GOs to the reference GO list - similar to what would be do with PANTHER, GOrilla, etc. - to look at GO categories that might be over (or under) represented in each list.

So, if we look at a simplified normal work flow (eg. PANTHER) we would:

Upload a list of differentially expressed genes (or multiple lists - up vs down regulated, etc.) in uniprot, (or equivalent) format.
The program will derive a list of GO terms from those genes and;
Compare the list of GOs (and/or GO categories) to a reference (either pre-compiled or user supplied) to look for enriched (or depauperate) categories of GOs.

I would simply like to find an analysis suite that allows me to jump in at step 2 - i.e. I supply the list(s) of GOs and the reference list...

I've not been able to find a tool that does this so I'm looking for pointers to tools that could help me do this.

Thank you.

ADD COMMENT • link 4.4 years ago by dpearton • 0

0

Entering edit mode

Please use the 'add reply' button to reply to a comment. Don't create an answer unless it addresses the question as otherwise the question appears answered when it's not.

The problem with comparing lists of GO terms is how to deal with the graph structure of the ontology. One way is to reduce the ontology to a controlled vocabulary, i.e. a flat list of terms of interests and compute similarity using e.g. the Jaccard index. Another approach that would take the graph structure into account could be to combine the semantic similarities between terms in the two lists.

ADD REPLY • link 4.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Odd, I thought I had used the "Add reply" button, but I got kicked to the login screen and when it returned me to the reply it must have changed the category.

Anyway, certainly I would like to take the graph structure into account which is why I am looking at tools such as PANTHER, REVIGO, etc. I.e. I would like to look at GO categories (not just GO terms) that are enriched (as these would be more likely to have biological relevance). This is why I would like to be able to use the already existing tools that do just that, without having to reinvent the wheel...

ADD REPLY • link 4.4 years ago by dpearton • 0

0

Entering edit mode

How do you define GO categories? It seems to me that you're thinking along the line of a flat vocabulary. Also, enrichment analysis involves counting occurrences. How would you do this with lists of GO terms? Presumably each term is present only once in a list but could be represented by a parent term in another list. Comparing lists of GO terms taking the ontology structure into account involves combining the semantic similarities of the terms between the two lists being considered. As Luis R suggests, you can use the R bioconductor package GoSemSim for this.

ADD REPLY • link 4.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I've realised that my terminology might be what is causing the issue here. I've been using GO terms when I should probably have been using the less ambiguous terminology, GO ID.

So: Each list of GO IDs is derived from a list of differentially expressed genes based on a particular expression pattern. So, for example we have stressor A with three levels 1, 2, 3. This will give us 6 simple patterns:

Pattern 1: genes with the same expression (-) in 1A, 1B and 1C

i.e.

1A -

1B -

1C -

Pattern 2: genes with different (up or down) expression (+) in 1A, but the same expression in 1B & 1C i.e.

1A +

1B -

1C -

etc... Each pattern is associated with a list of genes and many of those genes have been associated with one or more GO IDs. Given that many genes have similar function there will be both repeated GO IDs (if the gene is involved in the same pathway or function) and GO IDs that are closely related i.e. parent-child.

I have extracted the GO IDs associated for each of the genes in each of the patterns to give:

(1) lists of GO IDs - these contain all of the GO IDs associated with all of the genes in the specific list. There can be (and are) multiple identical GO IDs in each list. Thus, for example, in a particular pattern the GO ID GO:0006813 is present five times while in the reference list it is present 65 times.

(2) The reference list containing the list of all the GO IDs in the annotated reference database (with repeated GO IDs representing the frequency in which they are found in the original annotations).

It would be relatively easy to look at relative enrichment of just the GO IDs given that their frequencies are there (count the relative number of each ID in each list compared to the relative number in the reference) but, as you say, that will not use any of the parent-child information nor will it cluster related GOs.

So, what I have is: lists of GO IDs that contain frequency information in the number of times a particular GO is repeated in the list, a reference list of GO IDs(also with implicit frequency information). I would like to know what processes are over (or underrepresented) in each, for example - is there an over representation of the "response to unfolded protein" pathway/term in any particular pattern. I.e. cluster the GO IDs in each list based on shared properties (i.e. process, location, etc.) and see if those clusters are enriched or not.

As far as I can tell, a list of GO IDs is the intermediate output of programs like PANTHER, DAVID, Clusterprofiler, etc but I just can't figure out how to input this into any of these (or similar) programs. I will look at GOSemSim - I see that it clusters GO IDs based on their similarity, but will it enable me to form clusters from those similar IDs and look at enrichment of those clusters?

I hope this clarifies things.

ADD REPLY • link 4.4 years ago by dpearton • 0

0

Entering edit mode

How do you define enrichment of clusters of grouped GO IDs? The higher they are similar between them the earlier your clusters will merge

ADD REPLY • link 4.4 years ago by Lluís R. ★ 1.2k