Question

edgeR analysis dilemma -- what samples to include

1

Entering edit mode

6.4 years ago

moxu ▴ 510

Suppose you have three experimental conditions for your RNA-seq: control, compound A treatment, compound B treatment, each with replicates. Your RNA-seq samples look like the following:

c1,c2,a1,a2,b1,b2

And your factors are (1,1,2,2,3,3).

And you use edgeR for the analysis. You have two questions to answer: what are the DEGs for treatment A, and what are the DEGs for treatment B. Let's just focus on the 1st question -- what are the DEGs for treatment A?

In terms of sample inclusion for analysis, we have two options:

1) c1,c2, a1,a2. You just compare (a1,a2) with (c1,c2), e.g., coef=2

2) c1,c2, a1,a2, b1,b2. You compare (a1,a2) with (c1,c2), but you use contrast=c(-1,1,0).

The pros of using option 1) is that it's simple, because we do it all the time -- pairwise comparison. The pros of option 2) is that it uses more samples and should give you more statistic power, the cons is that it's not as simple, and the inclusion of condition B samples could complicate your analysis and usually makes you hard to explain to your biologists.

I compared the above two options, and found they generated somewhat different results. Power-wise (min p-values) sometimes option 1) is better, sometimes 2) is better.

Which way is a better practice?

Thanks!

RNA-Seq software error R next-gen • 2.1k views

ADD COMMENT • link updated 6.4 years ago by igor 13k • written 6.4 years ago by moxu ▴ 510

0

Entering edit mode

unless you have a group that received both A and B I'm not sure you're going to gain much from option 2

ADD REPLY • link 6.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Thought similarly. Would you have a better estimation of variance if you have more samples?

ADD REPLY • link 6.4 years ago by moxu ▴ 510

2

Entering edit mode

6.4 years ago

igor 13k

Regardless of statistics, only option 2 is realistic. Eventually someone is going to ask you to:

compare conditions A and B
put gene X for all conditions on the same bar plot (need to be normalized together)
add significance levels for each comparison (all pairwise comparison combinations will have to be done)

ADD COMMENT • link 6.4 years ago by igor 13k

0

Entering edit mode

Yeah, sounds reasonable to me. One thing worth mentioning might be the heatmap. Most biologists still think pair-wisely. When you present your heatmap, probably a better way is to just show (c1,c2,a1,a2) or (c1,c2,b1,b2) for your top hits. (c1,c2,a1,a2,b1,b2) can be shown if you display all genes or genes not selected by any pair-wise comparison.

ADD REPLY • link 6.4 years ago by moxu ▴ 510

0

Entering edit mode

I agree with your options. However, it really depends on your question. For example, you can plot significant genes from C vs A for all samples to see how B samples cluster. Are they more similar to C or A?

ADD REPLY • link 6.4 years ago by igor 13k

0

Entering edit mode

Agreed.

A related technical question: on the heatmap, can you cluster genes by the DEG between C & A and let B tag along without B being involved in the gene clustering? It would be nice if this can be done.

ADD REPLY • link 6.4 years ago by moxu ▴ 510

score 4 · Accepted Answer · 2017-12-09

In my opinion, the second option is by far the best, for two reasons :

As you said, the estimation of variance is better with more samples. By the way, this doesn't always mean more power (min pvalues). In some cases, the added samples increase the estimation of variance, so the pvalue increases, but for good reason.
It makes sense to normalize and process all your samples together, since they are from the same study. The fact that you can answer both your questions (effect of A and B) from the same dataset, only changing the contrast, makes the option B simpler than A. If you choose option A, the size factors and the estimation of variance would not be the same in dataset (c1,c2, a1,a2) and (c1,c2, b1,b2). Imagine showing biologist two excel sheets with the normalized raw counts, one for dataset (A vs C), the other for dataset (B vs C). Wouldn't it be confusing for him to see that the value for gene X in condition c1 is different in both sheets ?