GEO2R Analysis on Cel files downloaded from ArrayExpress
3
0
Entering edit mode
7.2 years ago

Hello,

 

I want to apply NCBI GEO's GEO2R analysis on cel files downloaded from ArrayExpress. Actually I am interested in finding differentially expressed genes in healthy vs diseased samples. If anybody has tried such a way can you share your code?

I have tried basic limma analysis by GCRMA or RMA normalization on data available in NCBI GEO and compared my result with GEO2R results and the results are very different (for example, genes above lfc 1), perhaps because GEO2R just apply log2 normalization. I want to use same approach used by GEO2R but by reading cel files (obtained from ArrayExpress) from my computer.

Kindly share your experience. Thanks

Limma differential expression ArrayExpress R • 3.6k views
ADD COMMENT
1
Entering edit mode
7.2 years ago
Shicheng Guo ★ 9.2k

Take GSE53045 as the example: GSE53045:Epigenome analysis of smoking in peripheral blood mononuclear cells (PBMC) samples:50 smokers and 61 non-smokers.

# Version info: R 2.14.1, Biobase 2.15.3, GEOquery 2.23.2, limma 3.10.1
# R scripts generated  Fri Jul 24 16:45:30 EDT 2015

################################################################
#   Differential expression analysis with limma
library(Biobase)
library(GEOquery)
library(limma)

# load series and platform data from GEO

gset <- getGEO("GSE53045", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL13534", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
sml <- c("G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","X","X","X");

# eliminate samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]

# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
          (qx[6]-qx[1] > 50 && qx[2] > 0) ||
          (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
  exprs(gset) <- log2(ex) }

# set up the data and proceed with analysis
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","RANGE_START","RANGE_END","RANGE_GB","SPOT_ID"))
write.table(tT, file=stdout(), row.names=F, sep="\t")

################################################################
#   Boxplot for selected GEO samples
library(Biobase)
library(GEOquery)

# load series and platform data from GEO

gset <- getGEO("GSE53045", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL13534", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# group names for all samples in a series
sml <- c("G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","X","X","X")

# eliminate samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]

# order samples by group
ex <- exprs(gset)[ , order(sml)]
sml <- sml[order(sml)]
fl <- as.factor(sml)
labels <- c("Smoker","Control")

# set parameters and draw the plot
palette(c("#dfeaf4","#f4dfdf", "#AABBCC"))
dev.new(width=4+dim(gset)[[2]]/5, height=6)
par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1))
title <- paste ("GSE53045", '/', annotation(gset), " selected samples", sep ='')
boxplot(ex, boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=fl)
legend("topleft", labels, fill=palette(), bty="n")
ADD COMMENT
0
Entering edit mode

Dear Shicheng,

I am aware that GEO provides the R code for the analysis. But what if I want to compare two groups from different NCBI submissions. Let's say "PBMC 1 Smoker" from GSE53045 and "Tretinoin" treated samples from GSE5007. I assume the one possible way (which I know) is to download the celfiles for these groups and read them from my computer. In this case how to read files and make an experssion set and apply limma so that I get the same results as will be provided by the GEO2R analysis (in case if it could has provide the cross-submission comparison)?

My problem is even if I analyze the experiment you mentioneded (GSE53045) by this approach I end up having different number of DEGs above a certain threshold. I want to get the same result as achieved by GEO2R. Kindly help me on this.

What I do when I read files from my computer:

  1. Read phenodata.txt file (describing the samples and groups)
  2. Apply GCRMA on them
  3. exprs(celfiles.gcrma)
  4. Make design of experiment
  5. Make contrast.matrix
  6. Apply Limma
  7. Save toptable with lfc more than 1.
ADD REPLY
0
Entering edit mode

In such situation, I suggest you to download both of these dataset in idat format and then call the signal and then do the differential analysis by costumed R or python script. Such jobs are belonged to bioinformatic or statistician.

ADD REPLY
1
Entering edit mode
7.2 years ago
andrew ▴ 550

As far as I know, and according to the documentation on GEO2R, GEO2R does not do any QC or normalization. It specifically states, "Submitters are asked to supply normalized data." but in my experience, this is rarely the case. Therefore, the GEO2R results are often based on non-QC and non-normalized data. If you you are using GCRMA, you will likely never get the same number of DE Genes as GEO2R.

You might consider trying iPathwayGuide. It allows you to upload raw CEL files (for certain platforms) and will perform the QC and normalization for you on the fly. It's free to use. But this is a web application - not an R script.

You can find out more on our website or watch this short 2 min video:

ADD COMMENT
0
Entering edit mode
7.2 years ago
Zhilong Jia ★ 2.1k

g​eo2r supplies code actually after analyzing a GSE dataset. You can check the code.

ADD COMMENT

Login before adding your answer.

Traffic: 1094 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6