Question: Significance in gene ontology terms
gravatar for The Last Word
14 months ago by
The Last Word180
The Last Word180 wrote:

When we do gene ontology analysis, we get a list of Biological processes, cellular components and molecular functions with different significant values. What does these significant values exactly mean. I did gene ontology analysis with a set of genes and I got the CC cytoplasm more significant when compared to the BO nucleic acid binding. Does that mean that more genes from my list is present in the cytoplasm than those participating in nucleic acid binding or does it mean that there is more research evidence pointing to genes being present in the cytoplasm as opposed to taking part in nucleic acid binding or does it mean something else? I used DAVID for my gene ontology analysis.

ontology • 557 views
ADD COMMENTlink modified 14 months ago by jared.andrews075.0k • written 14 months ago by The Last Word180

Just approaching this from a logical point of view, it seems to me that your input, a gene list, is being compared to a cluster of gene lists each of which are associated with a GO component. Would that not mean that the scores are essentially a measure of the overlap between the two lists? Also, I think inter-component (BP/CC/MF) significance values should not be comparable, but I might be mistaken - there might be a normalization step to the scoring process. However, would you not want the most probable CC, MF and BP components? Why compare between them if that's the case?

ADD REPLYlink written 14 months ago by RamRS25k

I am just getting a list of all the gene ontology terms given out by DAVID above a certain significance threshold. Of course comparing between significance values is not what I want to do or plan to do but this question just struck me when I saw the list of ontology significance terms.

ADD REPLYlink written 14 months ago by The Last Word180

If an answer was helpful you should upvote it, if the answer resolved your question, you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 14 months ago by Kevin Blighe54k
gravatar for jared.andrews07
14 months ago by
St. Louis, MO
jared.andrews075.0k wrote:

Pretty much all gene ontology analysis works the same way and answers the same question - for a given GO term, is your list of genes enriched for association with that term at a significantly higher rate than background? Your background could be all genes or a subset of genes depending on what exactly you're trying to determine. So for example, let's take your nucleic acid binding term in humans. There are roughly ~4000 protein coding genes associated with the term, a frequency of 20% if we round the number of protein coding genes to 20,000.

Let's say you have a list of 100 genes that are down-regulated after treatment with a compound, and you want to determine what biological processes may be disrupted. The analysis will go through all the terms and compare the frequency of genes associated with that term between lists to calculate a p-value (or q-value or FDR or whatever). Let's say 80 of your genes are associated with the nucleic acid binding term, which would show an increase in frequency that's very unlikely due to chance. The GO analysis would spit out that term as significantly enriched in your test set.

However, the background set is important. Including all genes in it makes a lot of assumptions, especially considering not all genes are expressed in all tissues or conditions. For example, if you're working in T cells and using all genes as your background set, almost any test set you put in will be enriched for T cell-related terms. But if you limit your background set to only genes that are expressed (say that limit cuts down the background set to ~12000 genes), your background frequencies are going to be much different, especially for cell/tissue-type specific terms. Typically, you'll want to at least exclude genes that are not expressed from your background set to capture a better snapshot of the "normal" state of the cell. This is one of the most overlooked concepts in GO analysis and renders a lot of the analyses seen in published papers relatively meaningless.

As for your actual question, it really means neither of those things, it merely reflects the confidence in differences in frequencies between gene sets as described above. GO has very little to do directly with research evidence, though the terms assigned to a given gene may be derived from it. Many of the associations are also inferred from things like protein structure/domains. Lastly, GO terms can be hilariously broad and nearly useless at times - I think this may be one such case, as "nucleic acid binding" and "cytoplasm" yield little info as to biological function. Indeed, many broad categories like that are just umbrella terms that have many, many additional child terms under them.

One last thing, DAVID is probably one of the most unwieldy GO analysis tools out there now. There was a time where it was one of very few options, but it's now very outdated, in my opinion. This is definitely subjective, but I find tools like enrichR and clusterProfiler to be much more attractive options.

ADD COMMENTlink modified 14 months ago • written 14 months ago by jared.andrews075.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 744 users visited in the last hour