edgeR batch effect
3
0
Entering edit mode
2.3 years ago
NielQC ▴ 10

Hi all,

although this topic has been much discussed, I'm not sure how to proceed in my situation. I'm analyzing an RNA-seq experiment looking for differentially expressed genes using edgeR. The experiment was designed as follows:

Sample  Treatment   Batch
TreatA-1    TreatA  Date1
TreatA-2    TreatA  Date1
TreatA-3    TreatA  Date2
TreatB-1    TreatB  Date3
TreatB-2    TreatB  Date3
TreatB-3    TreatB  Date4
TreatC-1    TreatC  Date1
TreatC-2    TreatC  Date1
TreatC-3    TreatC  Date2
TreatD-1    TreatD  Date3
TreatD-2    TreatD  Date3
TreatD-3    TreatD  Date4


Here the MDS plot:

I can see clearly the batch effect between replicates, where the third one (that is from a different batch) doesn't cluster correctly with the other two. Moreover, I wonder if the big differences observed between treatments AC and BD (explained mainly by dimension 1) is due to the different batches (Date1-Date2 vs Date3-Date4).

I have been trying to include these batch considerations in my design, but it seems the matrix is not of full rank.

group <- as.factor(c("TreatA","TreatA","TreatA",
"TreatB","TreatB","TreatB",
"TreatC","TreatC","TreatC",
"TreatD","TreatD","TreatD"))
batch <- factor(c("Date1","Date1","Date2","Date3","Date3","Date4",
"Date1","Date1","Date2","Date3","Date3","Date4"))
design <- model.matrix(~0+group+batch)

design
groupTreatA groupTreatB groupTreatC groupTreatD batchDate2 batchDate3 batchDate4
1            1           0           0           0          0          0          0
2            1           0           0           0          0          0          0
3            1           0           0           0          1          0          0
4            0           1           0           0          0          1          0
5            0           1           0           0          0          1          0
6            0           1           0           0          0          0          1
7            0           0           1           0          0          0          0
8            0           0           1           0          0          0          0
9            0           0           1           0          1          0          0
10           0           0           0           1          0          1          0
11           0           0           0           1          0          1          0
12           0           0           0           1          0          0          1
attr(,"assign")
[1] 1 1 1 1 2 2 2
attr(,"contrasts")
attr(,"contrasts")$group [1] "contr.treatment" attr(,"contrasts")$batch
[1] "contr.treatment"


First something that I still do not quite understand: where is the batch Date 1? I didn't use any intercept so I don't know how to interpret that is not in the design. Second, is there any way to include the batch effect avoiding the not full rank issue?

bactch effect design RNA-Seq edgeR R • 2.7k views
3
Entering edit mode
2.3 years ago
h.mon 32k

is there any way to include the batch effect avoiding the not full rank issue?

Your treatments and batch are confounded - treatments A and C are all from date 1 and 2, and treatments B and D are all from dates 3 and 4 - so no, you can't include batch in your model. You can compare A vs C, and B vs D.

3
Entering edit mode
2.3 years ago
Benn 8.1k

To answer the other half of your question. With ~0+ in your design you remove the intercept for the group factor, but you'll still get the intercept for batch. So batchDate1 is the intercept for factor batch. You interpret this by where batchDate2, 3, and 4 are all 0, that's where batchDate1 == 1.

1
Entering edit mode
2.3 years ago

Maybe you should check out this thread?

How to see if adjusting batch effect in RNA-seq is working or not

In general, I think it would be best if you used something other than the MDS plot to be confident the batch was appropriately adjusted, without over-fitting and/or over-correction. However, I think the individual feedback (about things like separate subsets of comparisons) is also important for your specific question. Or, if your treatments are defined based upon two conditions (where one may have more of an effect on gene expression than the other), does it help to have 2 groups to compare instead of 4 groups?

If it isn't possible to randomize the batch effects, perhaps you can check the functional enrichment? If it looks reasonable, maybe you can then perform medium-throughput or low-throughput validation of a subset of markers in more samples (if there were cost limitations for full randomization)?