Question: GEO2R Analysis on Cel files downloaded from ArrayExpress
0
gravatar for Bioinformatist Newbie
3.7 years ago by
Germany
Bioinformatist Newbie230 wrote:

Hello,

 

I want to apply NCBI GEO's GEO2R analysis on cel files downloaded from ArrayExpress. Actually I am interested in finding differentially expressed genes in healthy vs diseased samples. If anybody has tried such a way can you share your code?

I have tried basic limma analysis by GCRMA or RMA normalization on data available in NCBI GEO and compared my result with GEO2R results and the results are very different (for example, genes above lfc 1), perhaps because GEO2R just apply log2 normalization. I want to use same approach used by GEO2R but by reading cel files (obtained from ArrayExpress) from my computer.

Kindly share your experience. Thanks

ADD COMMENTlink modified 3.7 years ago by andrew460 • written 3.7 years ago by Bioinformatist Newbie230
1
gravatar for Shicheng Guo
3.7 years ago by
Shicheng Guo7.4k
Shicheng Guo7.4k wrote:
Take GSE53045 as the example: GSE53045:Epigenome analysis of smoking in peripheral blood mononuclear cells (PBMC) samples:50 smokers and 61 non-smokers.
# Version info: R 2.14.1, Biobase 2.15.3, GEOquery 2.23.2, limma 3.10.1
# R scripts generated  Fri Jul 24 16:45:30 EDT 2015

################################################################
#   Differential expression analysis with limma
library(Biobase)
library(GEOquery)
library(limma)

# load series and platform data from GEO

gset <- getGEO("GSE53045", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL13534", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
sml <- c("G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","X","X","X");

# eliminate samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]

# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
          (qx[6]-qx[1] > 50 && qx[2] > 0) ||
          (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
  exprs(gset) <- log2(ex) }

# set up the data and proceed with analysis
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)

tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","t","B","logFC","RANGE_START","RANGE_END","RANGE_GB","SPOT_ID"))
write.table(tT, file=stdout(), row.names=F, sep="\t")

################################################################
#   Boxplot for selected GEO samples
library(Biobase)
library(GEOquery)

# load series and platform data from GEO

gset <- getGEO("GSE53045", GSEMatrix =TRUE)
if (length(gset) > 1) idx <- grep("GPL13534", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# group names for all samples in a series
sml <- c("G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G0","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","G1","X","X","X")

# eliminate samples marked as "X"
sel <- which(sml != "X")
sml <- sml[sel]
gset <- gset[ ,sel]

# order samples by group
ex <- exprs(gset)[ , order(sml)]
sml <- sml[order(sml)]
fl <- as.factor(sml)
labels <- c("Smoker","Control")

# set parameters and draw the plot
palette(c("#dfeaf4","#f4dfdf", "#AABBCC"))
dev.new(width=4+dim(gset)[[2]]/5, height=6)
par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1))
title <- paste ("GSE53045", '/', annotation(gset), " selected samples", sep ='')
boxplot(ex, boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=fl)
legend("topleft", labels, fill=palette(), bty="n")
ADD COMMENTlink written 3.7 years ago by Shicheng Guo7.4k

Dear Shicheng,

I am aware that GEO provides the R code for the analysis. But what if I want to compare two groups from different NCBI submissions. Let's say " PBMC 1 Smoker" from GSE53045 and "Tretinoin" treated samples from GSE5007. I assume the one possible way (which I know) is to download the celfiles for these groups and read them from my computer. In this case how to read files and make an experssion set and apply limma so that I get the same results as will be provided by the GEO2R analysis (in case if it could has provide the cross-submission comparison)?

My problem is even if I analyze the experiment you mentioneded (GSE53045) by this approach I end up having different number of DEGs above a certain threshold. I want to get the same result as achieved by GEO2R. Kindly help me on this.

What I do when I read files from my computer:

1- Read phenodata.txt file (describing the samples and groups)

2- Apply GCRMA on them

3- exprs(celfiles.gcrma)

4- Make design of experiment

5- Make contrast.matrix

6- Apply Limma

7- Save toptable with lfc more than 1.

ADD REPLYlink written 3.7 years ago by Bioinformatist Newbie230

In such situation, I suggest you to download both of these dataset in idat format and then call the signal and then do the differential analysis by costumed R or python script. Such jobs are belonged to bioinformatic or statistician. 

ADD REPLYlink written 3.7 years ago by Shicheng Guo7.4k
0
gravatar for Zhilong Jia
3.7 years ago by
Zhilong Jia1.4k
London
Zhilong Jia1.4k wrote:

g​eo2r supplies code actually after analyzing a GSE dataset. You can check the code.

ADD COMMENTlink written 3.7 years ago by Zhilong Jia1.4k
0
gravatar for andrew
3.7 years ago by
andrew460
United States
andrew460 wrote:

As far as I know, and according to the documentation on GEO2R, GEO2R does not do any QC or normalization.  It specifically states, "Submitters are asked to supply normalized data." but in my experience, this is rarely the case.  Therefore, the GEO2R results are often based on non-QC and non-normalized data.  If you you are using GCRMA, you will likely never get the same number of DE Genes as GEO2R.

You might consider trying iPathwayGuide.  It allows you to upload raw CEL files (for certain platforms) and will perform the QC and normalization for you on the fly.  It's free to use.  But this is a web application - not an R script.

You can find out more on our website or watch this short 2 min video. https://www.youtube.com/watch?v=ocWHyUWYpbA

 

ADD COMMENTlink written 3.7 years ago by andrew460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 676 users visited in the last hour