Question

Benjamini–Hochberg correction, FDR in genes' expressions

0

Entering edit mode

20 months ago

gen92 • 0

I performed a differential genes-expression analysis, and found 100 genes being differentially expressed between the cases and controls. The un-corrected P-values are in the range of 0.015 - 0.05 (significant per alpha 0.05). Now when I apply Benjamini–Hochberg correction, there are few (<10) differentially expressed genes left with corrected-P <0.05 (or FDR<0.05). Theoretically, the FDR is applied to control/minimize the falsely discovered un-corrected Ps (which may be by chance, given the alpha 0.05). And after applying a correction, only 5% or so Ps should filter out. In my case, about 90% Ps are dropped after applying the correction.

Can anybody please explain the reason behind this. Or should I go with some other less stringent method to drop minimum Ps.

Thank you.

False-discovery-rate • 1.3k views

ADD COMMENT • link updated 20 months ago by Ming Tommy Tang ★ 4.5k • written 20 months ago by gen92 • 0

2

Entering edit mode

And after applying a correction, only 5% or so Ps should filter out.

Eh... I am sorry, it doesn't work like this! It is expected that an important proportion of significant pvalues, especially those not strongly significant (e.g. in the range 0.015 - 0.05) will become not significant after correcting for multiple testing. I am not a statistician and I don't feel prepared to explain in detail why, but I can refer you to this interesting page for a general overview on the topic.

ADD REPLY • link 20 months ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

Hi, you will find reading my blog post helpful https://divingintogeneticsandgenomics.rbind.io/post/understanding-p-value-multiple-comparisons-fdr-and-q-value/ , please also plot a histogram for the raw p-values too, you should see a spike around 0 for healthy p-values http://varianceexplained.org/statistics/interpreting-pvalue-histogram/

ADD REPLY • link 20 months ago by Ming Tommy Tang ★ 4.5k

score 0 · Answer 1 · 2023-03-17

As mentioned above, the large reduction in differentially expressed genes (DEGs) is not unexpected. High variability in your data can account for the low number of DEGs, or it may be a true biological result. Less stringent corrections for the sake of increasing the number of overall number of DEGs will just reduce confidence in the result, but something like independent hypothesis weighting, which prioritizes highly expressed genes may help (In R, you can use the package iwh, see vignette).

Alternatively, it's worth considering other analyses that don't depend on DEGs (eg. pathway enrichments, GSEA, etc).

score 0 · Answer 2 · 2023-03-17

Under a nominal p-value threshold of 0.05, if we test 20,000 genes, we would expect 1000 false positives on average (this is only on average, and we can say how many there will be in a given study). Thus, if you found 1,100 genes differentially expressed, we would expect 91% (1000/1100) of the hits to be false positives.

Under a BH correction we reduce the p-value threshold until we expect only 5% of the positives to be false positives. Your result is telling you that the only way to have fewer than 5% false positives is to have no positives.

If you only found 100 DE genes at a nominal p-value threshold of 0.05, this is fewer than the number of false positives we expect to find even if there are no true positives. Thus, at a first glance, I would expect them all to be false positives. Given this, I'm not sure any amount of hypothesis weighting etc will make any difference.

In fact, the fact that you have so many fewer genes significant at the nominal threshold than you might expect would make me wonder if there was something wrong with the analysis.

You might check the p-value histogram. It should either be relatively uniform, with equal numbers of genes in each pvalue bin (suggesting there is no difference between your samples), or it should make an enrichment of genes at low pvalues (suggesting a real signal that you are not sufficiently powered to assign to particular genes). If there is an enrichment of genes with high pvalues, this suggests that the model you are using hasn't fit the data well enough.