Question

Differential Expression Analysis using Bioconductor (RStudio) and GEO2R (GEO)

0

Entering edit mode

4.1 years ago

a.lathifbt ▴ 30

I was doing a differential gene expression analysis on a microarray dataset with three controls and 7 case samples. I used the limma package in RStudio to do the analysis. Wrote the following code based on Bioconductor vignettes and online tutorials:

library(limma)

edata<-exprs(group) #EXPRESSION MATRIX

mod<-model.matrix(~group_pheno$status)  #group_pheno contains data on the case/control phenotype of the samples

fit<-lmFit(edata,mod)

ebay<-eBayes(fit)

res<-topTable(ebay,sort.by = "none",number = dim(edata)[1])

I got around 91 differentially expressed genes from the above analysis (adj.P.val < 0.05).

However, when I tried to do the same analysis in the GEO2R option in the GEO database, I got no DEGs at the significance level adj.P.Val < 0.05. The lowest reported adj.P.value was around 0.079. I checked the Rscript given in the webpage for this analysis (given below)

# load series and platform data from GEO

gset <- getGEO("GSE34526", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
gsms <- "0001111111"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }

# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
          (qx[6]-qx[1] > 50 && qx[2] > 0) ||
          (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
  exprs(gset) <- log2(ex) }

# set up the data and proceed with analysis
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","Gene.symbol","Gene.title"))
write.table(tT, file=stdout(), row.names=F, sep="\t")

It seems that here they have used contrast matrix designs to calculate the DEGs. Can you please explain the correct approach to this problem and Where the mistake is in my code?

Thanks in Advance!

R gene DEG Bioconductor microarray • 2.5k views

ADD COMMENT • link updated 5 months ago by Bruno • 0 • written 4.1 years ago by a.lathifbt ▴ 30

score 0 · Answer 1 · 2020-03-04

0

Entering edit mode

4.1 years ago

Kevin Blighe 87k

You should get into the habit of checking the defaults for functions. For example, the topTable() function calls in your code chunks are going to behave differently due to default parameters not being set.

Also, why are these different?

mod<-model.matrix(~group_pheno$status)

design <- model.matrix(~ description + 0, gset)

What is contained in group_pheno$status?

----------------

The use of makeContrasts() allows you to specify a specific comparison ('contrast'), that is, for differential expression. In the bottom code chunk, G1 will be compared to G0, and the fold-changes will be reported as G1/G0 (G1 divided by G0). G1 and G0 are undoubtdedly factor levels contained in the gset$description variable. The use of these commands, i.e., makeContrasts and contrasts.fit, just gives you one way to perform specific differential expression comparisons.

One can avoid the use of makeContrasts and contrasts.fit by doing it the way that you have done it in your top code chunk; however, in your code, you have not set any value for coef in topTable(), so, limma does not know which groups you want to compare. You should set a value for coef when calling topTable(). and the value is typically a column name from mod.

Kevin

ADD COMMENT • link 4.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you, Kevin!

The group_pheno is a data frame (which I created manually) with two columns, the first one being the sample name and the other is the disease status (case or control).

> group_pheno
      Sample                    status
1  GSM850527                    normal
2  GSM850528                    normal
3  GSM850529                    normal
4  GSM850530 Polycystic ovary Syndrome
5  GSM850531 Polycystic ovary Syndrome
6  GSM850532 Polycystic ovary Syndrome
7  GSM850533 Polycystic ovary Syndrome
8  GSM850534 Polycystic ovary Syndrome
9  GSM850535 Polycystic ovary Syndrome
10 GSM850536 Polycystic ovary Syndrome

You should set a value for coef when calling topTable(). and the value is typically a column name from mod.

I tried this as well by adding the argument coef ="group_pheno$statusPolycystic ovary Syndrome" which is the second column in mod --- res<-topTable(ebay,sort.by = "B",number = dim(edata)[1],coef = "group_pheno$statusPolycystic ovary Syndrome",adjust.method = "fdr")

>mod
 (Intercept) group_pheno$statusPolycystic ovary Syndrome
1            1                                           0
2            1                                           0
3            1                                           0
4            1                                           1
5            1                                           1
6            1                                           1
7            1                                           1
8            1                                           1
9            1                                           1
10           1                                           1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`group_pheno$status`
[1] "contr.treatment"

Still, I am getting the same result of 91 DEGs as opposed to what I got from the GEO2R (0 DEG). Any ideas on where it might have gone wrong?

ADD REPLY • link 4.1 years ago by a.lathifbt ▴ 30

1

Entering edit mode

Thanks for pasting that data, but how does your design matrix relate to the G1 and G0 of the bottom code chunk? On face value, it looks like you are comparing 2 different traits in your top and bottom code chunks. That will explain the difference in end results.

Just take your time, and go through it step by step.

By the way, based on topTable's default behaviour, I think that your results are those of a differential expression comparison between Polycystic Ovary Syndrome status 1 versus 0, i.e., the final column in your design matrix, mod. However, in the bottom code chunk, this is not what is being compared.

ADD REPLY • link 4.1 years ago by Kevin Blighe 87k

0

Entering edit mode

PS - group_pheno$statusPolycystic ovary Syndrome is an ugly name for something like this. You should rename either the columns of the design matrix, or the original values in group_pheno.

colnames(mod) <- c('Intercept', 'POS')

Then topTable would be:

res <- topTable(ebay, sort.by = "B", coef = 'POS', adjust.method = "fdr")

After that, please do not even compare the results to those of the bottom code chunk, because different traits are being compared.

ADD REPLY • link 4.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you for the suggestion, Kevin! Will make that change in the colnames.

My ultimate goal is to identify the genes that are differentially expressed in the seven POS case compared to the three normal healthy controls, so is the code chunk (top) I used does this comparison? or should I go with the GEO2Rs (bottom chunk) for my analysis?

how does your design matrix relate to the G1 and G0 of the bottom code chunk?

I am too struggling with understanding how they can be related. I understand in the bottom chunk that they are assigning a G0/G1 (control/case) factor for the samples in the Expression Set and later use that to compare the groups via contrast matrix. But it does not seem straight forward in how the groups are being compared...

ADD REPLY • link 4.1 years ago by a.lathifbt ▴ 30

0

Entering edit mode

Hello Kevin!

Could you kindly please clarify my doubt on which code chunk to use for the type of analysis I am working on.

I'm quite new to this and really clueless about which is the right approach for my work...

Thank you

ADD REPLY • link 4.1 years ago by a.lathifbt ▴ 30

1

Entering edit mode

Hey Dude / Dudette. I would use the first code chunk. The second one, which is [I believe] automatically generated by GEO, can be misleading and sometimes incorrect / unsuitable for the study.

library(limma)
edata <- exprs(group)

# sort out the metadata names and set 'normal' as reference level
group_pheno$status <- as.character(group_pheno$status)
group_pheno$status <- gsub('Polycystic ovary Syndrome', 'POS', group_pheno$status)
group_pheno$status <- factor(group_pheno$status, levels = c('normal', 'POS'))

mod <- model.matrix( ~ group_pheno$status)
colnames(mod) <- c('Intercept', 'POS')

fit <- lmFit(edata, mod)
ebay <- eBayes(fit)

res <- topTable(ebay, coef = 'POS', number = Inf, p.value = 1, adjust = 'BH')

I believe that should work.

ADD REPLY • link 4.1 years ago by Kevin Blighe 87k

score 0 · Answer 2 · 2023-11-04

Hello everyone, I've been having the same question for a while now. I'm also conducting my own analysis of differential expression on a microarray dataset in R. However, the data is different from the results obtained using GEO2R. Here's my line of code:

my_id <- "GSE80178"
gse <- getGEO(my_id, GSEMatrix = TRUE, AnnotGPL = TRUE)

length(gse)
gse <- gse[[1]]
gse


sampleInfo

sample  treat

GSM2114231  tissue: foot ulcer  diabetic
GSM2114232  tissue: foot ulcer  diabetic
GSM2114233  tissue: foot ulcer  diabetic
GSM2114234  tissue: foot ulcer  diabetic
GSM2114235  tissue: foot ulcer  diabetic
GSM2114236  tissue: foot ulcer  diabetic
GSM2114240  tissue: foot skin   non-diabetic
GSM2114241  tissue: foot skin   non-diabetic
GSM2114242  tissue: foot skin   non-diabetic


design <- model.matrix(~0 + sampleInfo$sample)
design

colnames(design) <- c("NDM", "DFU")
design



NDM DFU

    0   1
    0   1
    0   1
    0   1
    0   1
    0   1
    1   0
    1   0
    1   0

cutoff <- median(exprs(gse[,-c(7:9)]))

is_expressed <- exprs(gse[,-c(7:9)]) > cutoff

keep <- rowSums(is_expressed) > 2

gse <- gse[keep, ]

fit <- lmFit(exprs(gse[,-c(7:9)]), design)

contrasts <- makeContrasts(DFU - NDM,
                           levels=design)

fit2 <- contrasts.fit(fit, contrasts)

fit2 <- eBayes(fit2)

top_table <- topTable(fit2, n = Inf, coef = 1)

logFC   AveExpr t   P.Value adj.P.Val   B
<dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
17121740    4.332877    12.256391   26.39945    1.618705e-10    2.677875e-06    13.55329
16671139    6.357127    10.556574   25.89515    1.954927e-10    2.677875e-06    13.42962
16693409    6.747750    9.681567    25.01266    2.743913e-10    2.677875e-06    13.20273
16693406    5.705290    9.171220    24.38503    3.517157e-10    2.677875e-06    13.03277
16990767    7.658577    9.380941    22.34467    8.246736e-10    5.023087e-06    12.42553
16848403    -5.384042   6.735076    -19.72763   2.765824e-09    1.403886e-05    11.50286

To find the anotaçõehugene20sttranscriptcluster.db, I encountered many values with expression among the samples, but GeneName, Symbol, or Entrez information was not available. After filtering out these NA values and applying a cutoff of adjust < 0.05 and LogFC > 1 or LogFC < -1, I obtained a list with 3,315 downregulated genes, 883 upregulated genes, and 16,941 showing no significance.

However, it seems that these values do not align with the analyses conducted by GEO2R. Can I trust my results based on the analysis I performed?

GEO2R Script:

#   Differential expression analysis with limma
library(GEOquery)

library(limma)

library(umap)

# load series and platform data from GEO

gset <- getGEO("GSE80178", GSEMatrix =TRUE, AnnotGPL=FALSE)

if (length(gset) > 1) idx <- grep("GPL16686", attr(gset, "names")) else idx <- 1

gset <- gset[[idx]]

# make proper column names to match toptable 

fvarLabels(gset) <- make.names(fvarLabels(gset))

# group membership for all samples

gsms <- "111111XXX000"

sml <- strsplit(gsms, split="")[[1]]

# filter out excluded samples (marked as "X")

sel <- which(sml != "X")

sml <- sml[sel]

gset <- gset[ ,sel]

# log2 transformation

ex <- exprs(gset)

qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))

LogC <- (qx[5] > 100) ||
          (qx[6]-qx[1] > 50 && qx[2] > 0)
if (LogC) { ex[which(ex <= 0)] <- NaN
  exprs(gset) <- log2(ex) }

# assign samples to groups and set up design matrix
gs <- factor(sml)
groups <- make.names(c("control","dfu"))
levels(gs) <- groups
gset$group <- gs
design <- model.matrix(~group + 0, gset)
colnames(design) <- levels(gs)

gset <- gset[complete.cases(exprs(gset)), ] # skip missing values

fit <- lmFit(gset, design)  # fit linear model

# set up contrasts of interest and recalculate model coefficients

cts <- paste(groups[1], groups[2], sep="-")

cont.matrix <- makeContrasts(contrasts=cts, levels=design)

fit2 <- contrasts.fit(fit, cont.matrix)

# compute statistics and table of top significant genes

fit2 <- eBayes(fit2, 0.01)

tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","RANGE_STRAND","RANGE_START","RANGE_END","GB_ACC","SPOT_ID","RANGE_GB"))
write.table(tT, file=stdout(), row.names=F, sep="\t")