Question

GEO2R Analysis on Cel files downloaded from ArrayExpress

0

Entering edit mode

10.0 years ago

Bioinformatist Newbie ▴ 270

Hello,

I want to apply NCBI GEO's GEO2R analysis on cel files downloaded from ArrayExpress. Actually I am interested in finding differentially expressed genes in healthy vs diseased samples. If anybody has tried such a way can you share your code?

I have tried basic limma analysis by GCRMA or RMA normalization on data available in NCBI GEO and compared my result with GEO2R results and the results are very different (for example, genes above lfc 1), perhaps because GEO2R just apply log2 normalization. I want to use same approach used by GEO2R but by reading cel files (obtained from ArrayExpress) from my computer.

Kindly share your experience. Thanks

Limma ArrayExpress differential-expression R • 5.2k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 10.0 years ago by Bioinformatist Newbie ▴ 270

Ram · Answer 1 · 2015-07-27

As far as I know, and according to the documentation on GEO2R, GEO2R does not do any QC or normalization. It specifically states, "Submitters are asked to supply normalized data." but in my experience, this is rarely the case. Therefore, the GEO2R results are often based on non-QC and non-normalized data. If you you are using GCRMA, you will likely never get the same number of DE Genes as GEO2R.

You might consider trying iPathwayGuide. It allows you to upload raw CEL files (for certain platforms) and will perform the QC and normalization for you on the fly. It's free to use. But this is a web application - not an R script.

You can find out more on our website or watch this short 2 min video:

Ram · Answer 2 · 2015-07-24

Take GSE53045 as the example: GSE53045:Epigenome analysis of smoking in peripheral blood mononuclear cells (PBMC) samples:50 smokers and 61 non-smokers.

# Version info: R 2.14.1, Biobase 2.15.3, GEOquery 2.23.2, limma 3.10.1
# R scripts generated  Fri Jul 24 16:45:30 EDT 2015

################################################################
#   Differential expression analysis with limma
library(Biobase)
library(GEOquery)
library(limma)

# load series and platform data from GEO

gset <- getGEO("GSE53045", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL13534", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
sml <- c("G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","X","X","X");

# eliminate samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]

# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
          (qx[6]-qx[1] > 50 && qx[2] > 0) ||
          (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
  exprs(gset) <- log2(ex) }

# set up the data and proceed with analysis
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","RANGE_START","RANGE_END","RANGE_GB","SPOT_ID"))
write.table(tT, file=stdout(), row.names=F, sep="\t")

################################################################
#   Boxplot for selected GEO samples
library(Biobase)
library(GEOquery)

# load series and platform data from GEO

gset <- getGEO("GSE53045", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL13534", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# group names for all samples in a series
sml <- c("G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","X","X","X")

# eliminate samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]

# order samples by group
ex <- exprs(gset)[ , order(sml)]
sml <- sml[order(sml)]
fl <- as.factor(sml)
labels <- c("Smoker","Control")

# set parameters and draw the plot
palette(c("#dfeaf4","#f4dfdf", "#AABBCC"))
dev.new(width=4+dim(gset)[[2]]/5, height=6)
par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1))
title <- paste ("GSE53045", '/', annotation(gset), " selected samples", sep ='')
boxplot(ex, boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=fl)
legend("topleft", labels, fill=palette(), bty="n")

Ram · Answer 3 · 2015-07-24

0

Entering edit mode

10.0 years ago

Zhilong Jia ★ 2.2k

geo2r supplies code actually after analyzing a GSE dataset. You can check the code.

ADD COMMENT • link updated 3.2 years ago by Ram 45k • written 10.0 years ago by Zhilong Jia ★ 2.2k