Question

Do I need to adjust the pvalue if I am only going to test one gene from omics data containing many genes?

0

Entering edit mode

14 months ago

cwwong13 ▴ 40

There are numerous online resources summarising the omics data and providing quick check on the association test between the omics and the phenotype (such as TCGA survival data and the gene expression found here).

My question is if I would like to use some of these data to supplement my current research question on one particular gene, (while I am also not 100% sure if the provided p-values have been adjusted for multiple comparisons), do I need to adjust these p-values? Or should I choose a more conservative cutoff (e.g. 2.5e-6 for RNAseq data)?

I am testing only one gene (with the hypothesis for that one gene only), is it valid to use the crude p-value? Would be nice if you can also point me to some reference articles supporting the claims. I have found a hard time in the search for such an article as the returned results are flooded with explanations of controlling the type 1 error in the scenario that is doing "exploratory" analysis blasting many genes.

pvalue Omics RNAseq • 756 views

ADD COMMENT • link 14 months ago by cwwong13 ▴ 40

score 2 · Answer 1 · 2023-02-12

You will find a range of different advice on this. In general if you only ever do one test, then you do t need to correct for multiple testing (because there is no multiple testing). To get a but philosophical: if you compute 10,000 pvalues, but only ever look at one, you've inly really done one hypothesis test, even if a computer has done 10,000 calculations (but not shown you the results).

So if you generate a DGE data set, run it through Limma, only extract the results and throw the rest away, absolutely, use the uncorrected value.

Thungs get a but trickier with big public data sets though. You might only look at ones gene, but i might look at a different one, and Bob might look at a third. Eventually enough people will look at enough genes that it is effectively like one person had looked at them all, and there is a multiple testing problem.

Some people will say you should correct for every test ever done in the history of science. These tend to be the sorts of people that think the whole hypothesis testing frame work doesn't make sense in and of itself.

score 1 · Answer 2 · 2023-02-12

Generally speaking, I found the motivating example that serves as introduction for Ji and Lui's paper Analyzing 'omics data using hierarchical models very suitable to apprehend the basic theory and problem of differential gene expression calling and ranking.

Regarding your question: p-values are a transformation of the test statistic T. When the null hypothesis H0 is true, p-values are realizations of an (approximately) uniform distribution p ∼ Unif(0, 1). Fig.1 in Storey and Tibshirani's paper, a histogram plot of p-values, nicely illustrates this. That figure also shows how a good T will tend to be larger under H1, so p will be smaller. Therefore, the smaller a p-value, the stronger the evidence against the null hypothesis H0 provided by the data.

Yet, p-values are random variables. So the exact value of p for your gene of interest has no meaning whatsoever. If H0 is true (your gene is not differentially expressed), the likelihood of p being 0.05 or 0.9 is approximately the same (p ∼ Unif(0, 1)). If your gene is differentially expressed, the likelihood of p being small is significantly higher. But you will need to find a biological mechanism and experimental proof to corroborate the association of your gene with your research question.