Question

Forum:FDR correction can sometimes lie to you!

5

Entering edit mode

10 weeks ago

Chakri ▴ 50

A heads-up for anyone working with high-dimensional omics data -- FDR correction can sometimes lie to you. A summary based on our recent publication: https://doi.org/10.1186/s13059-025-03734-z

The widely used False Discovery Rate (FDR) control method, Benjamini-Hochberg (BH), is a staple in omics research. But when analysing datasets with dependencies between features (like gene expression, methylation, metabolites, QTL analyses ++), it can behave unexpectedly.

Even when a study has no true biological signal (all null hypotheses are true), the BH method can occasionally generate thousands of statistically "significant" findings. This happens because dependencies in the data can cause many features to falsely appear significant together. While the overall FDR is controlled (e.g., <5% of experiments have errors), the experiments that do have errors can have thousands of them.

A Counter-Intuitive Trap: Using real-world and simulated data (methylation, gene expression, metabolite and eQTL analyses), we found this phenomenon to be persistent. The primary danger is that researchers may be misled by the sheer volume of these false findings. It feels intuitive to believe that if hundreds or thousands of features are flagged as significant, at least some of them must be real. However, we show this intuition can be wrong; it's possible that every single finding is false.

Risk of increased number of False Discoveries: This statistical artefact can lead researchers to incorrectly conclude the existence of an underlying biological mechanism, which might even form the main conclusion of their study. Issues like broken test assumptions, study biases, or the researcher’s flexibility in analyzing the data can make this problem even worse.

So, what can you do? We suggest a few key strategies: Use negative controls/synthetic null data and other diagnostic checks as recommended in the article to identify and minimize caveats. If continuing to use BH method -- try to know its assumptions and formal guarantees to ensure correct interpretation of the findings. As a safer alternative, consider the Benjamini-Yekutieli (BY) method when you can tolerate a bit more type II error. It doesn't completely eliminate the issue but makes these large false positive events much less frequent and severe. It's a good compromise between the popular BH method and overly conservative FWER corrections.

The bottom line: be aware of dependencies in your data! When false findings occur in highly correlated datasets, they can be numerous. Don't let your intuition fool you. Read the full open-access paper here: https://doi.org/10.1186/s13059-025-03734-z

Thanks to all collaborators & co-authors for useful inputs, brainstorming and perspectives: Maria Mamica, Emilie Willoch Olstad, Ingrid Hobæk Haff, Manuela Zucknick, Jingyi Jessica Li, & Geir Kjetil Sandve.

correction statistics FDR testing multiple • 11k views

ADD COMMENT • link updated 10 weeks ago by LChart 5.1k • written 10 weeks ago by Chakri ▴ 50

score 0 · Answer 1 · 2025-08-19

I will read the paper in detail and with intrerest, so forgive me if the ponits below are in the paper. The immediate thoughts that spring to mind:

All transcriptomics experiments are going to exhibit large numbers of dependencies (although we rarely have the power to detect them).
I've always advised people not to use "number of significant genes" as a measure of the size of a transcirptomic perturbation.
The reason we don't use bonferoni is less that it is too conservative (less just use bonferoni, but with a less conservative FWER theshold), but rather that it answer the wrong question. In any given experiment, I am not interested in p(at least one false discover), but the total fraction of discoveries that are false - these are qualitatively different, they are not just quantitative variations on the same thing.
I'm surprised you didn't also see inflation of false discoveries using Bonferonni. It seems intutive that false discovery under bonferioni should lead to other false discoveries in the presence of strong positive correlations between features. Is it perhaps that the power of any reasonable transcriptomics experiment under bonferonini that even the power to detect a gene with a high correlation to a false discovery is low? That if power was high enough (say a theoretic experiment with thousands or tens of thousands of samples), that you would see an inflation?

score 0 · Answer 2 · 2025-08-20

This is a great paper which sheds light on what may be a counter-intuitive topic. I find a very simple thought experiment to be illuminating:

Consider drawing a single null p-value, but replicating it 100 times, so you have perfect replication. You have an extreme scenario where 95% of the time, all 100 features are accepted, and 5% of the time all 100 features are rejected. In expectation this is still calibrated. No correction: 5/100 scenarios show all 100 features singificant; bonferroni: 5/10000 scenarios; BY: 5/515 scenarios. In expectation these procedures still control what they are intended to control; but all of them truly fall short of capturing how the scenario should be interpreted.

Fundamentally, the issue are not multiple-test correction methods failing to control error rates -- even in extreme cases, in expectation, they are all controlled! -- the issue is treating p-values or adjusted p-values as independent, failing to propagate the conditional structure forward into secondary (post-hoc) analyses. Bayesian approaches can properly model these dependencies (even the kind of low-dimension, high-ambient dimension dependency structure of expression) to propagate uncertainties appropriately from individual features (e.g., genes) into meta-features (e.g., gene sets). It is a constant disappointment that more focus isn't devoted to building efficient Bayesian models for standard omics tests. Bayesian error propgagation isn't even listed as a potential solution!