Question: Where to set intensity filtering threshold if oligo::rma returns large peak of very low intensities for Affymetrix microarrays?
gravatar for Mari
18 months ago by
Mari0 wrote:


I am trying to analyse a few Affymetrix microarrays deposited on NCBI GEO to determine if they have any DE genes in common. I am a newbie and am roughly following the End to end workflow by Klaus & Reisenauer 2018. I am using the GEOQuery package instead of the ArrayExpress package to download the raw data.

Example: GSE23738 (Rice et al 2011)

My problem: Klaus and Reisenauer recommend filtering out lowly expressed genes and do so manually by inspecting a histogram of median intensities post-background correction/normalisation (rma function). I struggle with deciding on a threshold value, since the oligo::rma function of the raw intensities results in a large peak of very low median intensities (~ 2-3). This issue occurred with GSE23738, but also several other different arrays (e.g. GSE120616, GSE102850, GSE11238).

Thank you very much for your help!!

Script used for all above mentioned arrays, however linear model was only applied to arrays with replication.


# GSE23738 - Rice et al 2011 (
untar("GSE23738/GSE23738_RAW.tar", exdir = "GSE23738/CEL")

rice_raw <- read.celfiles(list.files("GSE23738/CEL", full = TRUE, pattern = "CEL.gz")) 
pData(rice_raw)$filename <- sampleNames(rice_raw) 
replicate <- factor(c(1,2,3,1,2,3,1,2,3,1,2, 1,2,3,1,2,3), levels = 1:3, labels = c("1", "2", "3"))
time <- factor(c(1,1,1,2,2,2,3,3,3,4,4,5,5,5,6,6,6), levels = 1:6, labels = c("0 dpi", "1 dpi", "2 dpi", "3 dpi", "4 dpi", "5 dpi"))
pData(rice_raw)$replicate <- replicate 
pData(rice_raw)$time <- time

arrayQualityMetrics(expressionset = rice_raw, force = TRUE, do.logtransform = TRUE, intgroup = "time") 
rice_eset <- rma(rice_raw)
arrayQualityMetrics(expressionset = rice_eset, force = TRUE, do.logtransform = FALSE, intgroup = "time") # 1 outlier

# Histogram of median intensities to decide on intensity threshold
rice_medians <- rowMedians(exprs(rice_eset))
median_histogram <- hist(rice_medians, 100, col = "cornsilk", freq = FALSE, main = "Histogram of median intensities", border = "antiquewhite4", xlab = "Median intensities") 

# Continue without filtering
rice_annotations <- AnnotationDbi::select(mouse4302.db, keys = (featureNames(rice_eset)), columns = c("SYMBOL", "GENENAME", "ENTREZID"), keytype = "PROBEID")
rice_annotations <- subset(rice_annotations, ! 
multiple_annotation_probes <- filter(summarise(group_by(rice_annotations, PROBEID), no_of_matches = n_distinct(SYMBOL)), no_of_matches >1) 
probes_to_exclude <- (featureNames(rice_eset) %in% multiple_annotation_probes$PROBEID) 
rice_final <- subset(rice_eset, !probes_to_exclude)
fData(rice_final)$PROBEID <- rownames(fData(rice_final)) 
fData(rice_final) <- left_join(fData(rice_final), rice_annotations)
rownames(fData(rice_final)) <- fData(rice_final)$PROBEID 

rice_design <- model.matrix(~0 + time + replicate)
colnames(rice_design)[1:6] <- c("dpi0", "dpi1", "dpi2", "dpi3", "dpi4", "dpi5")
rownames(rice_design) <- replicate

# Comparison 1 dpi vs 0 dpi
contrast_matrix <- makeContrasts(dpi1-dpi0, levels = rice_design)
rice_analysis <- eBayes(, design = rice_design), contrast_matrix), trend = TRUE)

arrayQualityMetrics Density Plot (raw intensities) dens

arrayQualityMetrics Density Plot (eset intensities) dens-eset

Histogram of median intensities Rplot

plotSA output Rplot01

limma filtering microarray • 607 views
ADD COMMENTlink modified 13 months ago by Biostar ♦♦ 20 • written 18 months ago by Mari0

Please see How to add images to a Biostars post and then make the necessary changes. You have to provide the full link to the image including the suffix. I also suggest removing all code that is not directly related to the problem including the session info as it inflates the size of your post.

ADD REPLYlink written 18 months ago by ATpoint44k

Hey, regarding the threshold, there will never be just one specific threshold to choose over another. Filters can be set differently for each experiment. I would also encourage you to not permit that your mind 'festers' too much on this topic of setting a threshold. If you are aiming to do the filtering, then pick a threshold and see how many probes/genes are filtered out after the filtering has occurred.

Also, take a look in the Limma manual, where some filtering of probes is mentioned: (see Chapter 7)

ADD REPLYlink written 18 months ago by Kevin Blighe69k

Hi, I had indeed read the section of the Limma manual, but struggle with implementing it and hence hope to get help with it. I understand the idea behind the following section, but I don't know how to apply it in practice:

Some microarray platforms [...] include control probes from which the background intensity corresponding to unexpressed probes can be estimated. In such cases, it may be useful to keep probes that are expressed above background on at least k arrays, where k is the smallest number of replicates assigned to any of the treatment combinations.

I've only ever worked through the Klaus & Reisenauer 2018 example, where the choice of threshold was understandable. The large low-intensity peak that I have in my datasets is a bit off-putting, because I am not sure if it is sensible to assume it is all background and filter it out. I am hesitating, because I've come across the following in a couple of posts on filtering such as this post:

"You shouldn't need to filter more than about 25% of probe-sets for these arrays."

Thanks again for taking the time to help me understand this better!

ADD REPLYlink written 18 months ago by Mari0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2143 users visited in the last hour