Question

Subset DEG results and error control

1

Entering edit mode

3.4 years ago

thyleal ▴ 160

Hi,

I was asked to create a dashboard hosting some genome-wide microarray and RNA-seq results that were computed using limma (an empirical Bayes method). Basically, the dashboard would allow users to retrieve gene(s) and see if it was differentially expressed (DE) or not, mainly to generate new hypotheses.

In the past, when I wanted to test a specific subset of genes (e.g, pathway X, ~50 genes) from a genome-wide experiment, I'd proceed as usual, but in the last step, I'd only adjust for multiple testing the subset list of genes of interest and not even look at the remaning genes (as they were not of interest).

Thinking in this dashboard, please consider this scenario:

User inputs 100 genes in the dashboard to screen (representing 1 or 2 solid biological pathways). Then, the original DE table is filtered and Benjamini-Hochberg or Benjamini-Yekutieli FDR are applied only to those 100 genes and returned to user. User see the results and then decides to include more genes from the same or other pathways, now totalling 150 genes. What should the dashboard do?

A) Adjust P-values for all 150 genes, not the new 50;
B) Adjust P-values separately for each set (adjust for 100) and then adjust for 50;
C) Best to not use genome-wide methods (e.g, empirical Bayes), perform DE with standard NHST (e.g, t-test, ANOVA) and adjust for multiple testing using 150 genes;
D) Best to not use genome-wide methods (e.g, empirical Bayes), perform DE withstandard NHST (e.g, t-test, ANOVA) and adjust for multiple testing separately for set 1 and 2;
E) Other better appraches?

I also know that these procedures do not adjust confidence intervals, so that's another thing...

Does it make sense? I have a feeling that this is not ideal. But I'm not sure if there's a way to proper control the error rates using this dashboard. I also assume the user would be honest and tell the software which lists were already retrieved or the system would have some sort of caching and editing past queries.

Probably my question does not has an easy answer, but I'd like to hear from statisticians and more experienced peers.

Thank you.

FDR error control statistics genome-wide inference • 1.2k views

ADD COMMENT • link updated 3.4 years ago by rpolicastro 13k • written 3.4 years ago by thyleal ▴ 160

score 3 · Accepted Answer · 2020-12-11

3

Entering edit mode

3.4 years ago

rpolicastro 13k

It's generally not advised to run FDR correction on subsets of RNA-seq values. FDR uses the distribution of p-values for correction; importantly the somewhat uniform right tail of the distribution is used to find the cutoff value. Depending on how you are sampling p-values, you are changing the distribution of p-values going into the calculation, and these distributions may not be representative of the whole data.

A better approach would be to apply FDR correction on the whole dataset, and simply subset that precalculated table on your dashboard.

ADD COMMENT • link 3.4 years ago by rpolicastro 13k

0

Entering edit mode

Thanks! It makes sense and it is very straightforward...

ADD REPLY • link 3.4 years ago by thyleal ▴ 160

0

Entering edit mode

Do you know of a procedure that do not use the p-value distribution? Except from FWER, such as Bonferroni. Thank you again.

ADD REPLY • link 3.4 years ago by thyleal ▴ 160

0

Entering edit mode

Also, if I can at least see no deviation from the FDR from the subset vs. from full set, do you believe it is reasonable to use it on this subset? I mean, based on FDR estimated from the Storey method (qvalue) they look similar. Eg:

Full - 16000 p-values https://ibb.co/QJTJJkh

Subset - 600 p-values https://ibb.co/mFZxHtN

ADD REPLY • link 3.4 years ago by thyleal ▴ 160

0

Entering edit mode

Is that a random sample of genes? From what it sounds like you want to let people pick their own subset of genes, which is not random sampling though.

ADD REPLY • link 3.4 years ago by rpolicastro 13k

0

Entering edit mode

Not random, two specific pathways, e.g: TGFB pathway and STAT cascade.

ADD REPLY • link 3.4 years ago by thyleal ▴ 160

0

Entering edit mode

I've been studying Storey's method and I also did the following: estimated the q-values using whole dataset. Then, I used the genome-wide estimate of true null tests (pi0) and used in the subset version to obtain q-values dependent on the genome-wide p-value distribution. What do you think of this? The estimates are more liberal than BH, but more conservative than obtaining the pi0 from the subset.

ADD REPLY • link 3.4 years ago by thyleal ▴ 160