Question: Significance in gene ontology terms
gravatar for The Last Word
19 months ago by
The Last Word190
The Last Word190 wrote:

When we do gene ontology analysis, we get a list of Biological processes, cellular components and molecular functions with different significant values. What does these significant values exactly mean. I did gene ontology analysis with a set of genes and I got the CC cytoplasm more significant when compared to the BO nucleic acid binding. Does that mean that more genes from my list is present in the cytoplasm than those participating in nucleic acid binding or does it mean that there is more research evidence pointing to genes being present in the cytoplasm as opposed to taking part in nucleic acid binding or does it mean something else? I used DAVID for my gene ontology analysis.

ontology • 835 views
ADD COMMENTlink modified 19 months ago by jared.andrews076.1k • written 19 months ago by The Last Word190

Just approaching this from a logical point of view, it seems to me that your input, a gene list, is being compared to a cluster of gene lists each of which are associated with a GO component. Would that not mean that the scores are essentially a measure of the overlap between the two lists? Also, I think inter-component (BP/CC/MF) significance values should not be comparable, but I might be mistaken - there might be a normalization step to the scoring process. However, would you not want the most probable CC, MF and BP components? Why compare between them if that's the case?

ADD REPLYlink written 19 months ago by RamRS27k

I am just getting a list of all the gene ontology terms given out by DAVID above a certain significance threshold. Of course comparing between significance values is not what I want to do or plan to do but this question just struck me when I saw the list of ontology significance terms.

ADD REPLYlink written 19 months ago by The Last Word190
gravatar for jared.andrews07
19 months ago by
St. Louis, MO
jared.andrews076.1k wrote:

Pretty much all gene ontology analysis works the same way and answers the same question - for a given GO term, is your list of genes enriched for association with that term at a significantly higher rate than background? Your background could be all genes or a subset of genes depending on what exactly you're trying to determine. So for example, let's take your nucleic acid binding term in humans. There are roughly ~4000 protein coding genes associated with the term, a frequency of 20% if we round the number of protein coding genes to 20,000.

Let's say you have a list of 100 genes that are down-regulated after treatment with a compound, and you want to determine what biological processes may be disrupted. The analysis will go through all the terms and compare the frequency of genes associated with that term between lists to calculate a p-value (or q-value or FDR or whatever). Let's say 80 of your genes are associated with the nucleic acid binding term, which would show an increase in frequency that's very unlikely due to chance. The GO analysis would spit out that term as significantly enriched in your test set.

However, the background set is important. Including all genes in it makes a lot of assumptions, especially considering not all genes are expressed in all tissues or conditions. For example, if you're working in T cells and using all genes as your background set, almost any test set you put in will be enriched for T cell-related terms. But if you limit your background set to only genes that are expressed (say that limit cuts down the background set to ~12000 genes), your background frequencies are going to be much different, especially for cell/tissue-type specific terms. Typically, you'll want to at least exclude genes that are not expressed from your background set to capture a better snapshot of the "normal" state of the cell. This is one of the most overlooked concepts in GO analysis and renders a lot of the analyses seen in published papers relatively meaningless.

As for your actual question, it really means neither of those things, it merely reflects the confidence in differences in frequencies between gene sets as described above. GO has very little to do directly with research evidence, though the terms assigned to a given gene may be derived from it. Many of the associations are also inferred from things like protein structure/domains. Lastly, GO terms can be hilariously broad and nearly useless at times - I think this may be one such case, as "nucleic acid binding" and "cytoplasm" yield little info as to biological function. Indeed, many broad categories like that are just umbrella terms that have many, many additional child terms under them.

One last thing, DAVID is probably one of the most unwieldy GO analysis tools out there now. There was a time where it was one of very few options, but it's now very outdated, in my opinion. This is definitely subjective, but I find tools like enrichR and clusterProfiler to be much more attractive options.

ADD COMMENTlink modified 19 months ago • written 19 months ago by jared.andrews076.1k

Excellent answer! I want to put special emphasis to However, the background set is important.. This is I think the most important aspect in any enrichment analysis as backgrounds notably change outcomes. This is especially important when people run enrichment analysis on assays where the number of analysed genes or proteins is notably smaller than the union of all annotated genes or the total number of genes commonly detected in standard bulk RNA-seq (typically one detects there roughly 10.000 - 15.000 genes where "detected" means they have sufficiently large counts to be retained in differential analysis). Examples for assays with lower gene numbers could be 10X-based scRNA-seq where you commonly detect far below 10k genes across an entire dataset or high-throughput proteomics approaches where you measure several thousands but below 10k proteins. This is obviously different from using all annotated genes as background which (depending on source, RefSeq, Gencode...) can be several ten-thousands of genes. In fact I abandoned several tools which do not allow to define for exactly this reason. If you do not even detect most genes in your analysis then it is improper to have them inflating your background.

ADD REPLYlink written 1 day ago by ATpoint36k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 752 users visited in the last hour