Question: edgeR analysis dilemma -- what samples to include
0
21 months ago by
moxu440
moxu440 wrote:

Suppose you have three experimental conditions for your RNA-seq: control, compound A treatment, compound B treatment, each with replicates. Your RNA-seq samples look like the following:

c1,c2,a1,a2,b1,b2

And you use edgeR for the analysis. You have two questions to answer: what are the DEGs for treatment A, and what are the DEGs for treatment B. Let's just focus on the 1st question -- what are the DEGs for treatment A?

In terms of sample inclusion for analysis, we have two options:

1) c1,c2, a1,a2. You just compare (a1,a2) with (c1,c2), e.g., coef=2

2) c1,c2, a1,a2, b1,b2. You compare (a1,a2) with (c1,c2), but you use contrast=c(-1,1,0).

The pros of using option 1) is that it's simple, because we do it all the time -- pairwise comparison. The pros of option 2) is that it uses more samples and should give you more statistic power, the cons is that it's not as simple, and the inclusion of condition B samples could complicate your analysis and usually makes you hard to explain to your biologists.

I compared the above two options, and found they generated somewhat different results. Power-wise (min p-values) sometimes option 1) is better, sometimes 2) is better.

Which way is a better practice?

Thanks!

modified 21 months ago by igor8.3k • written 21 months ago by moxu440

unless you have a group that received both A and B I'm not sure you're going to gain much from option 2

Thought similarly. Would you have a better estimation of variance if you have more samples?

4
21 months ago by
Carlo Yague4.6k
Belgium
Carlo Yague4.6k wrote:

In my opinion, the second option is by far the best, for two reasons :

1. As you said, the estimation of variance is better with more samples. By the way, this doesn't always mean more power (min pvalues). In some cases, the added samples increase the estimation of variance, so the pvalue increases, but for good reason.

2. It makes sense to normalize and process all your samples together, since they are from the same study. The fact that you can answer both your questions (effect of A and B) from the same dataset, only changing the contrast, makes the option B simpler than A. If you choose option A, the size factors and the estimation of variance would not be the same in dataset (c1,c2, a1,a2) and (c1,c2, b1,b2). Imagine showing biologist two excel sheets with the normalized raw counts, one for dataset (A vs C), the other for dataset (B vs C). Wouldn't it be confusing for him to see that the value for gene X in condition c1 is different in both sheets ?

Well said. Thanks much!

2
21 months ago by
igor8.3k
United States
igor8.3k wrote:

Regardless of statistics, only option 2 is realistic. Eventually someone is going to ask you to:

• compare conditions A and B
• put gene X for all conditions on the same bar plot (need to be normalized together)
• add significance levels for each comparison (all pairwise comparison combinations will have to be done)

Yeah, sounds reasonable to me. One thing worth mentioning might be the heatmap. Most biologists still think pair-wisely. When you present your heatmap, probably a better way is to just show (c1,c2,a1,a2) or (c1,c2,b1,b2) for your top hits. (c1,c2,a1,a2,b1,b2) can be shown if you display all genes or genes not selected by any pair-wise comparison.

I agree with your options. However, it really depends on your question. For example, you can plot significant genes from C vs A for all samples to see how B samples cluster. Are they more similar to C or A?

Agreed.

A related technical question: on the heatmap, can you cluster genes by the DEG between C & A and let B tag along without B being involved in the gene clustering? It would be nice if this can be done.