Hi all,
Hope someone can help me.
I have an RNA-Seq dataset I am analysing using edgeR. The data relate to a drug treatment at different concentrations (5uM, 10uM and 20uM) vs untreated (UT). I have three replicates for each concentration and three for UT.
My initial goal is simply to look for DE genes between UT vs 5uM and I've gone about this in two ways. In the first I only ingested the count data for the 3 5uM samples and the 3 UT samples, performed the analysis and got ~2K upregulated and ~2K downregulated genes. In the second, I ingested all data for all concentrations, performed the analysis with an altered design matrix and contrast to focus on the UT vs 5uM comparison and got 34 up and 69 downregulated genes. I'm obviously confused as to why I get vastly different numbers of DE genes and obviously must be doing something wrong.
In case my problem lies in my design matrix and contrasts, these are as follows:
First way: UT.0 DRUG.5 1 1 0 2 1 0 3 1 0 4 0 1 5 0 1 6 0 1 Second way: UT.0 DRUG.10 DRUG.20 DRUG.5 1 1 0 0 0 2 1 0 0 0 3 1 0 0 0 4 0 0 0 1 5 0 0 0 1 6 0 0 0 1 7 0 1 0 0 8 0 1 0 0 9 0 1 0 0 10 0 0 1 0 11 0 0 1 0 12 0 0 1 0
I used these as follows:
UTvs5 <- makeContrasts(DRUG.5-UT.0, levels=design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, contrast=UTvs5)
In the second way, I assumed that as the contrast should ignore the 10uM and 20uM data, the results for both methods should be the same in terms of DE.
I would really appreciate it if someone could tell me if I am doing something wrong here or if including the other datasets affects other parts of the standard edgeR pipeline (normalisation, dispersion estimates etc), or might it be something else!
Many thanks in advance.
EDIT - including additional analysis code as requested
At this point the count data have been ingested into countdata
and I have a phenotype file called phendat
, which contains
SampleID Conc Treatment 1 UT_1 0 UT 2 UT_2 0 UT 3 UT_3 0 UT 4 DRUG_5_31 5 DRUG 5 DRUG_5_32 5 DRUG 6 DRUG_5_33 5 DRUG 7 DRUG_10_34 10 DRUG 8 DRUG_10_35 10 DRUG 9 DRUG_10_36 10 DRUG 10 DRUG_20_37 20 DRUG 11 DRUG_20_38 20 DRUG 12 DRUG_20_39 20 DRUG
Retain genes if they are expressed at a counts-per-million (CPM) above 0.5 in at least three samples
myCPM <- cpm(countdata)
thresh <- myCPM > 0.5
keep <- rowSums(thresh) >= 3
counts.keep <- countdata[keep,]
y <- DGEList(counts.keep)
normalise and produce an MDS plot
mypalette.conc <- brewer.pal(4,"Set1")
col.conc <- mypalette.conc[phendat$Conc]
y <- calcNormFactors(y)
plotMDS(y, col=col.conc, dim.plot=c(1,2), xlim=c(-1.5,1.5), ylim=c(-1.5,1), cex=0.8)
I admit, I should have looked at this more closely before proceeding. Many thanks for setting me straight. To me it looks like there are definite complications with this data. I have checked with the person who generated this data and they are confident that due to the procedures and equipment used it was very unlikely there was a sample swap between DRUG_5_32 and DRUG_10_36. Also I would certainly not expect the 20uM data to be anywhere near the UT. I assume this points to some other hidden batch effect. Please let me know what you think, and many thanks again for your help.