Differential expression using the whole dataset vs. using subset
0
0
Entering edit mode
6.2 years ago

Hi all,

I have a statistical question. Over and over, I've been asked to run differential expression analyses on subsets of genes instead of the whole dataset (so genes that don't reach the cutoff after adjusting the p.value suddenly do). Though instinct tells me this is cheating and I always say so, the "formal" reason for this eludes me. I understand the reason for correcting for multiple testing, and I understand the simple math for calculating adjusted p.values, I just don't get the concept of a test being more reliable when you run it 200 times than when you run it 20000, even when the initial values are exactly the same. I would like to be able to offer an explanation to my colleagues other than "just because" (and of course, understand it myself) can anyone explain this to me in a way so I will be able to explain properly next time?

Sorry for the basic question (I don't have a strong statistics background) and thank you all in advance.

RNA-Seq statistics • 1.5k views
ADD COMMENT
1
Entering edit mode

I'd say if your method of filtering is independent on the test you perform then it's okay. Read about "independent filtering".

ADD REPLY
0
Entering edit mode

Over and over, I've been asked to run differential expression analyses on subsets of genes

Without any strong reason to do that, this is plain wrong! You sometimes remove those genes which are very lowly expressed in samples. But that's not equivalent to running it on a subset.

I just don't get the concept of a test being more reliable when you run it 200 times than when you run it 20000, even when the initial values are exactly the same.

I am not sure if I am getting your concern here. Could you elaborate a bit?

ADD REPLY
1
Entering edit mode

As per Santosh, in many cases in expression studies, we end up discarding half of the dataset based on low expression (e.g. mean raw counts below 10, for example), or low variance, or for some other reason like too many missing or NA values. This practice is generally accepted and it helps to reduce the stringency of the applied false discovery method later on during differential expression analysis.

So, we already run our differential expression analyses on 'subsets' of our datasets and it helps to avoid the increased 'harshness' of a FDR-correction that would come with more variables being tested. I do not regard this as 'cheating' in any way. As analyst, it's up to you to feel confident in your results and to tell your colleague(s) what is right and wrong, and what the limitations of each method are. Granted, this confidence in rebuking a biologist comes with experience.

Differential expression analysis is neither the end of the line - it just helps to give us clues. There are many downstream methods that can be used to gauge how trustworthy the differential expression results are, such as regression modelling.

Also look up what Wouter says, as it can be important to help eliminate unreliable genes that would otherwise give false P values.

ADD REPLY

Login before adding your answer.

Traffic: 2935 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6