Suppose you have three experimental conditions for your RNA-seq: control, compound A treatment, compound B treatment, each with replicates. Your RNA-seq samples look like the following:
c1,c2,a1,a2,b1,b2
And your factors are (1,1,2,2,3,3).
And you use edgeR for the analysis. You have two questions to answer: what are the DEGs for treatment A, and what are the DEGs for treatment B. Let's just focus on the 1st question -- what are the DEGs for treatment A?
In terms of sample inclusion for analysis, we have two options:
1) c1,c2, a1,a2. You just compare (a1,a2) with (c1,c2), e.g., coef=2
2) c1,c2, a1,a2, b1,b2. You compare (a1,a2) with (c1,c2), but you use contrast=c(-1,1,0).
The pros of using option 1) is that it's simple, because we do it all the time -- pairwise comparison. The pros of option 2) is that it uses more samples and should give you more statistic power, the cons is that it's not as simple, and the inclusion of condition B samples could complicate your analysis and usually makes you hard to explain to your biologists.
I compared the above two options, and found they generated somewhat different results. Power-wise (min p-values) sometimes option 1) is better, sometimes 2) is better.
Which way is a better practice?
Thanks!
unless you have a group that received both A and B I'm not sure you're going to gain much from option 2
Thought similarly. Would you have a better estimation of variance if you have more samples?