Question: In R scripts of GEO2R which line is responsible for background correction and replacing replicated probes with the mean?
gravatar for Sib
12 months ago by
Sib20 wrote:

The commands below are the R scripts that are used to analyze my microarray data. I want to know which lines are responsible for 1-Replacing replicated probes with the mean 2-Background correction

#   Differential expression analysis with limma

# load series and platform data from GEO

gset <- getGEO("GSE116959", GSEMatrix =TRUE, AnnotGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL17077", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

# make proper column names to match toptable 
fvarLabels(gset) <- make.names(fvarLabels(gset))

# group names for all samples
gsms <- "00010000000100000000000000000000011110000010000000000000000011010010"
sml <- c()
for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) }

# log2 transform
ex <- exprs(gset)
qx <- as.numeric(quantile(ex, c(0., 0.25, 0.5, 0.75, 0.99, 1.0), na.rm=T))
LogC <- (qx[5] > 100) ||
          (qx[6]-qx[1] > 50 && qx[2] > 0) ||
          (qx[2] > 0 && qx[2] < 1 && qx[4] > 1 && qx[4] < 2)
if (LogC) { ex[which(ex <= 0)] <- NaN
  exprs(gset) <- log2(ex) }

# set up the data and proceed with analysis
sml <- paste("G", sml, sep="")    # set group names
fl <- as.factor(sml)
gset$description <- fl
design <- model.matrix(~ description + 0, gset)
colnames(design) <- levels(fl)
fit <- lmFit(gset, design)
cont.matrix <- makeContrasts(G1-G0, levels=design)
fit2 <-, cont.matrix)
fit2 <- eBayes(fit2, 0.01)
tT <- topTable(fit2, adjust="fdr","B", number=250)
ADD COMMENTlink written 12 months ago by Sib20
gravatar for Ahill
12 months ago by
United States
Ahill1.9k wrote:

From a quick look at the script and at the GEO record for GSE116959, I'd say there are no lines in this code that do either replacement of replicated probes or background correction. The source GEO data appears to be log2-scale quantile-normalized codelink array intensities, so in your script I expect LogC will be false. The final 10 lines of the script set up a linear model, fit it, and compute differential expression between your two factor levels. If replacement of replicated probes or background correction is being done it is not in this script or mentioned in GEO data processing description. Something like background correction might be done as part of the primary Agilent data reduction, but can't tell that from the above alone.

ADD COMMENTlink written 12 months ago by Ahill1.9k

Thanks for your answer. Can we say whenever we obtain datasets by using getGEO function, the data are background corrected and duplicates are substituted with mean, and even if we want to have our analyze by R (without GEO2R), if we get datasets through this function there is no need to background correction and substitution of duplicates with the mean?

ADD REPLYlink written 12 months ago by Sib20

It would not be safe to say that background correction and duplicate substitution have been done. That would be determined by the original authors who submitted the data to GEO. GEO does not require specific data processing methods as a practice. For many common expression platforms background subtraction is likely done; and for some GEO submissions, duplicate processing might have been done. But you would need to read the associated manuscript or the data processing description on the GEO record to determine with certainty if that processing has been done for any specific GEO dataset.

ADD REPLYlink written 12 months ago by Ahill1.9k
gravatar for ATpoint
12 months ago by
ATpoint44k wrote:

I think GEO2R assumes that the data are already normalized when pulling via getGEO() (correct me if I am wrong). That means they use what the authors have uploaded when submitting the data. This is actually the problem and also the reason why I never use this application. You have essentially no control over how the data have been processed and which probes have been removed (e.g. control probes). This can have quite an impact because control probes can be quite numerous (thousands in some arrays) and this then influences e.g. multiple testing correction (if not removed they increase MT burden). If possible I always start from raw CEL files. If these are not available be super careful with the results.

ADD COMMENTlink modified 12 months ago • written 12 months ago by ATpoint44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2336 users visited in the last hour