Question

Closed:edgeR analysis. Is my design matrix correct, or am I missing a batch effect?

0

Entering edit mode

3.5 years ago

oakhamwolf ▴ 20

Hi all,

Hope someone can help me.

I have an RNA-Seq dataset I am analysing using edgeR. The data relate to a drug treatment at different concentrations (5uM, 10uM and 20uM) vs untreated (UT). I have three replicates for each concentration and three for UT.

My initial goal is simply to look for DE genes between UT vs 5uM and I've gone about this in two ways. In the first I only ingested the count data for the 3 5uM samples and the 3 UT samples, performed the analysis and got ~2K upregulated and ~2K downregulated genes. In the second, I ingested all data for all concentrations, performed the analysis with an altered design matrix and contrast to focus on the UT vs 5uM comparison and got 34 up and 69 downregulated genes. I'm obviously confused as to why I get vastly different numbers of DE genes and obviously must be doing something wrong.

In case my problem lies in my design matrix and contrasts, these are as follows:

First way:
  UT.0  DRUG.5
1    1       0
2    1       0
3    1       0
4    0       1
5    0       1
6    0       1

Second way:
   UT.0  DRUG.10  DRUG.20  DRUG.5
1     1        0        0       0
2     1        0        0       0
3     1        0        0       0
4     0        0        0       1
5     0        0        0       1
6     0        0        0       1
7     0        1        0       0
8     0        1        0       0
9     0        1        0       0
10    0        0        1       0
11    0        0        1       0
12    0        0        1       0

I used these as follows:

UTvs5 <- makeContrasts(DRUG.5-UT.0, levels=design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, contrast=UTvs5)

In the second way, I assumed that as the contrast should ignore the 10uM and 20uM data, the results for both methods should be the same in terms of DE.

I would really appreciate it if someone could tell me if I am doing something wrong here or if including the other datasets affects other parts of the standard edgeR pipeline (normalisation, dispersion estimates etc), or might it be something else!

Many thanks in advance.

EDIT - including additional analysis code as requested

At this point the count data have been ingested into countdata and I have a phenotype file called phendat, which contains

  SampleID   Conc  Treatment    
 1 UT_1       0     UT       
 2 UT_2       0     UT       
 3 UT_3       0     UT       
 4 DRUG_5_31  5     DRUG     
 5 DRUG_5_32  5     DRUG     
 6 DRUG_5_33  5     DRUG     
 7 DRUG_10_34 10    DRUG     
 8 DRUG_10_35 10    DRUG     
 9 DRUG_10_36 10    DRUG     
10 DRUG_20_37 20    DRUG     
11 DRUG_20_38 20    DRUG     
12 DRUG_20_39 20    DRUG

Retain genes if they are expressed at a counts-per-million (CPM) above 0.5 in at least three samples

myCPM <- cpm(countdata)
thresh <- myCPM > 0.5
keep <- rowSums(thresh) >= 3
counts.keep <- countdata[keep,]
y <- DGEList(counts.keep)

normalise and produce an MDS plot

mypalette.conc <- brewer.pal(4,"Set1")
col.conc <- mypalette.conc[phendat$Conc]
y <- calcNormFactors(y)
plotMDS(y, col=col.conc, dim.plot=c(1,2), xlim=c(-1.5,1.5), ylim=c(-1.5,1), cex=0.8)

MDSPlot

I admit, I should have looked at this more closely before proceeding. Many thanks for setting me straight. To me it looks like there are definite complications with this data. I have checked with the person who generated this data and they are confident that due to the procedures and equipment used it was very unlikely there was a sample swap between DRUG_5_32 and DRUG_10_36. Also I would certainly not expect the 20uM data to be anywhere near the UT. I assume this points to some other hidden batch effect. Please let me know what you think, and many thanks again for your help.

RNA-Seq edgeR design matrix • 163 views

ADD COMMENT • link 3.5 years ago by oakhamwolf ▴ 20