Question: edgeR analysis dilemma -- what samples to include
0
gravatar for moxu
16 months ago by
moxu430
moxu430 wrote:

Suppose you have three experimental conditions for your RNA-seq: control, compound A treatment, compound B treatment, each with replicates. Your RNA-seq samples look like the following:

c1,c2,a1,a2,b1,b2

And your factors are (1,1,2,2,3,3).

And you use edgeR for the analysis. You have two questions to answer: what are the DEGs for treatment A, and what are the DEGs for treatment B. Let's just focus on the 1st question -- what are the DEGs for treatment A?

In terms of sample inclusion for analysis, we have two options:

1) c1,c2, a1,a2. You just compare (a1,a2) with (c1,c2), e.g., coef=2

2) c1,c2, a1,a2, b1,b2. You compare (a1,a2) with (c1,c2), but you use contrast=c(-1,1,0).

The pros of using option 1) is that it's simple, because we do it all the time -- pairwise comparison. The pros of option 2) is that it uses more samples and should give you more statistic power, the cons is that it's not as simple, and the inclusion of condition B samples could complicate your analysis and usually makes you hard to explain to your biologists.

I compared the above two options, and found they generated somewhat different results. Power-wise (min p-values) sometimes option 1) is better, sometimes 2) is better.

Which way is a better practice?

Thanks!

ADD COMMENTlink modified 16 months ago by igor7.6k • written 16 months ago by moxu430

unless you have a group that received both A and B I'm not sure you're going to gain much from option 2

ADD REPLYlink written 16 months ago by Jeremy Leipzig18k

Thought similarly. Would you have a better estimation of variance if you have more samples?

ADD REPLYlink written 16 months ago by moxu430
4
gravatar for Carlo Yague
16 months ago by
Carlo Yague4.4k
Belgium
Carlo Yague4.4k wrote:

In my opinion, the second option is by far the best, for two reasons :

  1. As you said, the estimation of variance is better with more samples. By the way, this doesn't always mean more power (min pvalues). In some cases, the added samples increase the estimation of variance, so the pvalue increases, but for good reason.

  2. It makes sense to normalize and process all your samples together, since they are from the same study. The fact that you can answer both your questions (effect of A and B) from the same dataset, only changing the contrast, makes the option B simpler than A. If you choose option A, the size factors and the estimation of variance would not be the same in dataset (c1,c2, a1,a2) and (c1,c2, b1,b2). Imagine showing biologist two excel sheets with the normalized raw counts, one for dataset (A vs C), the other for dataset (B vs C). Wouldn't it be confusing for him to see that the value for gene X in condition c1 is different in both sheets ?

ADD COMMENTlink modified 16 months ago • written 16 months ago by Carlo Yague4.4k

Well said. Thanks much!

ADD REPLYlink written 16 months ago by moxu430
2
gravatar for igor
16 months ago by
igor7.6k
United States
igor7.6k wrote:

Regardless of statistics, only option 2 is realistic. Eventually someone is going to ask you to:

  • compare conditions A and B
  • put gene X for all conditions on the same bar plot (need to be normalized together)
  • add significance levels for each comparison (all pairwise comparison combinations will have to be done)
ADD COMMENTlink modified 16 months ago • written 16 months ago by igor7.6k

Yeah, sounds reasonable to me. One thing worth mentioning might be the heatmap. Most biologists still think pair-wisely. When you present your heatmap, probably a better way is to just show (c1,c2,a1,a2) or (c1,c2,b1,b2) for your top hits. (c1,c2,a1,a2,b1,b2) can be shown if you display all genes or genes not selected by any pair-wise comparison.

ADD REPLYlink written 16 months ago by moxu430

I agree with your options. However, it really depends on your question. For example, you can plot significant genes from C vs A for all samples to see how B samples cluster. Are they more similar to C or A?

ADD REPLYlink written 16 months ago by igor7.6k

Agreed.

A related technical question: on the heatmap, can you cluster genes by the DEG between C & A and let B tag along without B being involved in the gene clustering? It would be nice if this can be done.

ADD REPLYlink written 16 months ago by moxu430
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1459 users visited in the last hour