Question: edgeR batch effect
gravatar for NielQC
5 months ago by
NielQC10 wrote:

Hi all,

although this topic has been much discussed, I'm not sure how to proceed in my situation. I'm analyzing an RNA-seq experiment looking for differentially expressed genes using edgeR. The experiment was designed as follows:

Sample  Treatment   Batch
TreatA-1    TreatA  Date1
TreatA-2    TreatA  Date1
TreatA-3    TreatA  Date2
TreatB-1    TreatB  Date3
TreatB-2    TreatB  Date3
TreatB-3    TreatB  Date4
TreatC-1    TreatC  Date1
TreatC-2    TreatC  Date1
TreatC-3    TreatC  Date2
TreatD-1    TreatD  Date3
TreatD-2    TreatD  Date3
TreatD-3    TreatD  Date4

Here the MDS plot:


I can see clearly the batch effect between replicates, where the third one (that is from a different batch) doesn't cluster correctly with the other two. Moreover, I wonder if the big differences observed between treatments AC and BD (explained mainly by dimension 1) is due to the different batches (Date1-Date2 vs Date3-Date4).

I have been trying to include these batch considerations in my design, but it seems the matrix is not of full rank.

group <- as.factor(c("TreatA","TreatA","TreatA",
batch <- factor(c("Date1","Date1","Date2","Date3","Date3","Date4", 
design <- model.matrix(~0+group+batch)

      groupTreatA groupTreatB groupTreatC groupTreatD batchDate2 batchDate3 batchDate4
1            1           0           0           0          0          0          0
2            1           0           0           0          0          0          0
3            1           0           0           0          1          0          0
4            0           1           0           0          0          1          0
5            0           1           0           0          0          1          0
6            0           1           0           0          0          0          1
7            0           0           1           0          0          0          0
8            0           0           1           0          0          0          0
9            0           0           1           0          1          0          0
10           0           0           0           1          0          1          0
11           0           0           0           1          0          1          0
12           0           0           0           1          0          0          1
[1] 1 1 1 1 2 2 2
[1] "contr.treatment"

[1] "contr.treatment"

First something that I still do not quite understand: where is the batch Date 1? I didn't use any intercept so I don't know how to interpret that is not in the design. Second, is there any way to include the batch effect avoiding the not full rank issue?

Thanks in advance.

ADD COMMENTlink modified 5 months ago by Charles Warden6.9k • written 5 months ago by NielQC10
gravatar for h.mon
5 months ago by
h.mon26k wrote:

Answering just half of your questions:

is there any way to include the batch effect avoiding the not full rank issue?

Your treatments and batch are confounded - treatments A and C are all from date 1 and 2, and treatments B and D are all from dates 3 and 4 - so no, you can't include batch in your model. You can compare A vs C, and B vs D.

ADD COMMENTlink written 5 months ago by h.mon26k
gravatar for Benn
5 months ago by
Benn6.9k wrote:

To answer the other half of your question. With ~0+ in your design you remove the intercept for the group factor, but you'll still get the intercept for batch. So batchDate1 is the intercept for factor batch. You interpret this by where batchDate2, 3, and 4 are all 0, that's where batchDate1 == 1.

ADD COMMENTlink written 5 months ago by Benn6.9k
gravatar for Charles Warden
5 months ago by
Charles Warden6.9k
Duarte, CA
Charles Warden6.9k wrote:

Maybe you should check out this thread?

How to see if adjusting batch effect in RNA-seq is working or not

In general, I think it would be best if you used something other than the MDS plot to be confident the batch was appropriately adjusted, without over-fitting and/or over-correction. However, I think the individual feedback (about things like separate subsets of comparisons) is also important for your specific question. Or, if your treatments are defined based upon two conditions (where one may have more of an effect on gene expression than the other), does it help to have 2 groups to compare instead of 4 groups?

If it isn't possible to randomize the batch effects, perhaps you can check the functional enrichment? If it looks reasonable, maybe you can then perform medium-throughput or low-throughput validation of a subset of markers in more samples (if there were cost limitations for full randomization)?

ADD COMMENTlink written 5 months ago by Charles Warden6.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1084 users visited in the last hour