For my RNA-seq project, my collaborator generally provides a list of DEGs and the output is ranked based on p-value, not fold-change. While I understand the significance of p-value in the multiple-testing problems , I have always assumed once a gene passes the p-value cutoff, there is little difference between p-value = 1E-40, 1E-10 and 1E-3, biologically. As long as the gene passes the threshold, it'd the fold change that can explain the biology. Is this correct? Is there a value of ranking the genes based on p-values?
P-values and the like are somewhat complicated beasts, so I am out of my depth here, and take my answer with every grain of salt you have...
As long as the gene passes the threshold, it'd the fold change that can explain the biology. Is this correct?
I think this is not true, passing or not the threshold does not guarantee biological significance. A low p-value tells your data is unlikely under the null hypothesis, but it do not tell which is more likely:
- null is true but your samples were funky, or not representative from the population you want to draw conclusions about
- null is false
This is especially true for RNAseq data, as most experiments have a small number of biological replicates. Besides, our thresholds are somewhat arbitrary anyway.
Another layer of complexity is that genes are not sampled equally, as some genes are much more expressed than others. This in turn leads to two potential problems:
1) smaller differences will be called significant for highly expressed genes, while larger differences from lowly or moderately expressed genes will be non-significant. 2) the fold-change for more expressed genes will be more reliably estimated.
Also, interpreting RNAseq log(fold changes) is not straight-forward, as they are moderated: a small number is added to every count, to avoid infinite log(fold changes). This will mean log(fold changes) from genes with zero or low counts will be farther from the "truth" than genes with large counts, and this may affect their relative ranking.
Finally, I take issue with your "explain the biology" expression, as I think it is rather fuzzy and ill-defined.