Question

PANTHER gene ontology - is the bonferroni correction important?

0

Entering edit mode

6.7 years ago

sbrown669 ▴ 20

This is what it says on PANTHER with regards to the test:

The expression data analysis statistics now include a Bonferroni correction for multiple testing. The Bonferroni correction is important because we are performing many statistical tests (one for each pathway, or each ontology term) at the same time. This correction multiplies the single-test P-value by the number of independent tests to obtain an expected error rate.

For pathways, we now correct the reported P-values by multiplying by the number of associated pathways with two or more genes. Some proteins participate in multiple pathways, so the tests are not completely independent of each other and the Bonferroni correction is conservative. For ontology terms, the simple Bonferroni correction becomes extremely conservative because parent (more general) and child (more specific) terms are not independent at all: any gene or protein associated with a child term is also associated with the parent (and grandparent, etc.) terms as well.

To estimate the number of independent tests for an ontology, we count the number of classes with at least two genes in the reference list that are annotated directly to that class (i.e. not indirectly via an annotation to a more specific subclass).

So I have been submitting gene lists with LFCs to PANTHER. When I apply the correction, I get no results. However, when I run panther without it, I get lots of significant results. How important is it to the results? Is it too stringent?

RNA-Seq ontology panther statistics • 5.4k views

ADD COMMENT • link updated 6.7 years ago by mforde84 ★ 1.4k • written 6.7 years ago by sbrown669 ▴ 20

1

Entering edit mode

6.7 years ago

mforde84 ★ 1.4k

One of the assumptions of Bonferroni (and FDR methods as well e.g., BH) is independence of observation. For gene ontologies that's technically not the case, since terms are apart of acyclic directed graphs. So theres almost always an issue of dependence except for the very lowest or highest semantic levels (which are arguably the most meaningless... like yay... I'm enriched for biological processes or ... yay I'm enriched for ionic signaling... so what?). So if you want to do robust multiple testing corrections for GO analysis, you'd have to do some sort of weighted analysis. I still see people doing Bonferroni out of either habit, a misunderstanding of the statistics involved, or both. Does it accurately correct for error in these instances? Arguably no. More meaningful approaches are either weighted, permutation testing, or a rank based approached. E.g., sort by lowest pvalue, and establish a strict cutoff of say 1.0E-15. Also look at the number of genes that are being enriched for a term. If you see an enrichment in your list, and only one gene is driving the enrichment, then it's probably bologna.

ADD COMMENT • link 6.7 years ago by mforde84 ★ 1.4k

0

Entering edit mode

You're right that the corrections assume independence of the tests. However, the Bonferroni correction in the case of non-independence is over-conservative (see for example here) so it's OK to use if you're fine with that. The same goes for the Benjamini-Hochberg FDR correction, see this paper. So using these corrections approaches in the case where the tests are not independent is perfectly justified.

Does it accurately correct for error in these instances?

What do you mean by this ? Bonferroni does strictly control the type I error rate.

The problem with the Bonferroni correction is that for large numbers of tests, it becomes way too conservative to the point where one doesn't find anything significant which is why in such cases, people prefer to use the FDR approach.

There's no single correct way of dealing with multiple testing. It really depends on the situation and on how costly are the false positives (type I error) versus false negatives (type II error).

ADD REPLY • link 6.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I think we'll just have to agree to disagree here. I understand what you're saying but in what sense and to what degree interpreting results using these methods when test assumptions don't hold is questionable. Whether conservative or not. The a priori probability that two dependent hypotheses are both false is not 25%, it's unknown until you have priors to assess the relationship. So if you can't make an apriori assumption there, why does it make sense to make an assumption about family wise error or false discovery rate?

ADD REPLY • link 6.7 years ago by mforde84 ★ 1.4k

0

Entering edit mode

I agree that in general, one should make sure assumptions hold when applying statistical tests but I think you're missing the point of what conservative means. It means that the p-value you obtain is guaranteed to be greater than the real one. When the assumptions are met, the FDR gives you the proportion of false positives but when the tests are not independent, it gives you an upper bound. I could still be useful to know that you have less than 10% false positives.

ADD REPLY • link 6.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Sure, but conservative to what degree? How conservative is too conservative even? Makes something like OPs issue impossible to address using these types of statistical testing. If there is a potentially better statistical method available to interpret the data, then use that one instead right?

ADD REPLY • link 6.7 years ago by mforde84 ★ 1.4k

0

Entering edit mode

You know to what degree it's conservative: you get an upper bound on the p-value. There is no method better than another for multiple testing adjustment. Whichever you choose will give you a trade-off between false positives and false negatives. People are usually concerned with the removal of false positives but the cost of being sure of having none can be that you've also removed many true positives (sometimes all of them). In the case of GO terms enrichment, the question is what is the cost of considering a list of genes to be enriched in a particular term ? If we want to characterize the list or indirectly the process generating the list, we don't want too many mistakes but on the other hand, we don't like to have nothing to report so false negatives are an issue, even though few people acknowledge it. What's the solution then ? In my opinion, if one doesn't like the probabilistic treatment of the data, one should design experiments that directly and unambiguously address the question of interest. This usually means focused experiments. When this is not possible or not the goal, the other option is to try and minimize the number of tests. First, consider that statistical significance doesn't mean biological relevance and in some instances, the null hypothesis of the test is not even biologically credible so reasoning with domain knowledge should be preferable to blind statistical tests. Second, one can be smarter in the choice of tests. For example instead of testing the whole GO like most tools do, why not only test pertinent terms (e.g. like in this paper or using domain knowledge or even common sense, e.g. why test for organism development terms when one is doing experiments using cells in culture) ? Statistical testing in biology is an ill-posed problem (or may be simply misused). The question most people want to address is whether their hypothesis is true. This is not the question that statistical tests answer. The tests measure how compatible your data is with the null hypothesis which usually is some sort of model for random data generation. Therefore rejecting the null hypothesis doesn't mean that the hypothesis the experiment was designed to assess is true. So p-values do not offer support for a particular model or hypothesis, only against the data being observed if the null hypothesis is true, which is usually irrelevant.

ADD REPLY • link 6.7 years ago by Jean-Karim Heriche 27k

score 2 · Accepted Answer · 2017-08-10

It is as described in what you posted from the Panther website. You need correction for multiple testing to control for significant results that would happen by chance (which would be considered false positives). So there is a trade-off, reduce the number of false positives and you increase the number of false negatives (truly significant results that are missed). You need to decide what is more important to you: having few false positives or having few false negatives. As stated, Bonferroni is too conservative, i.e. it is biased towards no false positives/many false negatives. This is why people often prefer the false discovery rate as this allows you to more precisely control the proportion of false positives.
So to recap, with no correction, you get many significant results with likely many false positives. With Bonferroni correction, you get few false positives at the cost of many false negatives. In your case, you don't get anything after correction which would indicate that a majority of the results could also be explained by chance only.