Differential Expression Analysis using Bioconductor (RStudio) and GEO2R (GEO)
Entering edit mode
13 months ago
a.lathifbt ▴ 20

I was doing a differential gene expression analysis on a microarray dataset with three controls and 7 case samples. I used the limma package in RStudio to do the analysis. Wrote the following code based on Bioconductor vignettes and online tutorials:


edata<-exprs(group) #EXPRESSION MATRIX

mod<-model.matrix(~group_pheno$status)  #group_pheno contains data on the case/control phenotype of the samples



res<-topTable(ebay,sort.by = "none",number = dim(edata)[1])

I got around 91 differentially expressed genes from the above analysis (adj.P.val < 0.05).

However, when I tried to do the same analysis in the GEO2R option in the GEO database, I got no DEGs at the significance level adj.P.Val < 0.05. The lowest reported adj.P.value was around 0.079. I checked the Rscript given in the webpage for this analysis (given below)

# load series and platform data from GEO

gset <- getGEO("GSE34526", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
gsms <- "0001111111"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }

# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
          (qx[6]-qx[1] > 50 && qx[2] > 0) ||
          (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
  exprs(gset) <- log2(ex) }

# set up the data and proceed with analysis
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title"))
write.table(tT, file=stdout(), row.names=F, sep="\t")

It seems that here they have used contrast matrix designs to calculate the DEGs. Can you please explain the correct approach to this problem and Where the mistake is in my code?

Thanks in Advance!

R gene DEG Bioconductor microarray • 721 views
Entering edit mode
13 months ago

You should get into the habit of checking the defaults for functions. For example, the topTable() function calls in your code chunks are going to behave differently due to default parameters not being set.

Also, why are these different?


design <- model.matrix(~ description + 0, gset)

What is contained in group_pheno$status?


The use of makeContrasts() allows you to specify a specific comparison ('contrast'), that is, for differential expression. In the bottom code chunk, G1 will be compared to G0, and the fold-changes will be reported as G1/G0 (G1 divided by G0). G1 and G0 are undoubtdedly factor levels contained in the gset$description variable. The use of these commands, i.e., makeContrasts and contrasts.fit, just gives you one way to perform specific differential expression comparisons.

One can avoid the use of makeContrasts and contrasts.fit by doing it the way that you have done it in your top code chunk; however, in your code, you have not set any value for coef in topTable(), so, limma does not know which groups you want to compare. You should set a value for coef when calling topTable(). and the value is typically a column name from mod.


Entering edit mode

Thank you, Kevin!

The group_pheno is a data frame (which I created manually) with two columns, the first one being the sample name and the other is the disease status (case or control).

> group_pheno
      Sample                    status
1  GSM850527                    normal
2  GSM850528                    normal
3  GSM850529                    normal
4  GSM850530 Polycystic ovary Syndrome
5  GSM850531 Polycystic ovary Syndrome
6  GSM850532 Polycystic ovary Syndrome
7  GSM850533 Polycystic ovary Syndrome
8  GSM850534 Polycystic ovary Syndrome
9  GSM850535 Polycystic ovary Syndrome
10 GSM850536 Polycystic ovary Syndrome

You should set a value for coef when calling topTable(). and the value is typically a column name from mod.

I tried this as well by adding the argument coef ="group_pheno$statusPolycystic ovary Syndrome" which is the second column in mod --- res<-topTable(ebay,sort.by = "B",number = dim(edata)[1],coef = "group_pheno$statusPolycystic ovary Syndrome",adjust.method = "fdr")

 (Intercept) group_pheno$statusPolycystic ovary Syndrome
1            1                                           0
2            1                                           0
3            1                                           0
4            1                                           1
5            1                                           1
6            1                                           1
7            1                                           1
8            1                                           1
9            1                                           1
10           1                                           1
[1] 0 1
[1] "contr.treatment"

Still, I am getting the same result of 91 DEGs as opposed to what I got from the GEO2R (0 DEG). Any ideas on where it might have gone wrong?

Entering edit mode

Thanks for pasting that data, but how does your design matrix relate to the G1 and G0 of the bottom code chunk? On face value, it looks like you are comparing 2 different traits in your top and bottom code chunks. That will explain the difference in end results.

Just take your time, and go through it step by step.

By the way, based on topTable's default behaviour, I think that your results are those of a differential expression comparison between Polycystic Ovary Syndrome status 1 versus 0, i.e., the final column in your design matrix, mod. However, in the bottom code chunk, this is not what is being compared.

Entering edit mode

PS - group_pheno$statusPolycystic ovary Syndrome is an ugly name for something like this. You should rename either the columns of the design matrix, or the original values in group_pheno.

colnames(mod) <- c('Intercept', 'POS')

Then topTable would be:

res <- topTable(ebay, sort.by = "B", coef = 'POS', adjust.method = "fdr")

After that, please do not even compare the results to those of the bottom code chunk, because different traits are being compared.

Entering edit mode

Thank you for the suggestion, Kevin! Will make that change in the colnames.

My ultimate goal is to identify the genes that are differentially expressed in the seven POS case compared to the three normal healthy controls, so is the code chunk (top) I used does this comparison? or should I go with the GEO2Rs (bottom chunk) for my analysis?

how does your design matrix relate to the G1 and G0 of the bottom code chunk?

I am too struggling with understanding how they can be related. I understand in the bottom chunk that they are assigning a G0/G1 (control/case) factor for the samples in the Expression Set and later use that to compare the groups via contrast matrix. But it does not seem straight forward in how the groups are being compared...

Entering edit mode

Hello Kevin!

Could you kindly please clarify my doubt on which code chunk to use for the type of analysis I am working on.

I'm quite new to this and really clueless about which is the right approach for my work...

Thank you

Entering edit mode

Hey Dude / Dudette. I would use the first code chunk. The second one, which is [I believe] automatically generated by GEO, can be misleading and sometimes incorrect / unsuitable for the study.

edata <- exprs(group)

# sort out the metadata names and set 'normal' as reference level
group_pheno$status <- as.character(group_pheno$status)
group_pheno$status <- gsub('Polycystic ovary Syndrome', 'POS', group_pheno$status)
group_pheno$status <- factor(group_pheno$status, levels = c('normal', 'POS'))

mod <- model.matrix( ~ group_pheno$status)
colnames(mod) <- c('Intercept', 'POS')

fit <- lmFit(edata, mod)
ebay <- eBayes(fit)

res <- topTable(ebay, coef = 'POS', number = Inf, p.value = 1, adjust = 'BH')

I believe that should work.


Login before adding your answer.

Traffic: 2279 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6