Tutorial: Analysing Microarray Data In Bioconductor
gravatar for Obi Griffith
5.7 years ago by
Obi Griffith16k
Washington University, St Louis, USA
Obi Griffith16k wrote:

I was thinking about creating a tutorial on how to do a simple microarray analysis in Bioconductor. But, I realized this has already been done quite nicely at the Bioinformatics Knowledgeblog. Their first tutorial on the subject covers installation of necessary packages, downloading of cel files, describing the experiment, loading and normalizing data, quality controls, probe set filtering, finding differentially expressed probe sets, and finally annotating those probe sets to gene symbols. They have follow-up tutorials on visualizing microarray data with volcano plots and assessing the biological significance of results. All three seem quite excellent after a brief perusal.

I might have also included: filtering with 'genefilter', multiple testing correction with 'multtest', and visualizations with 'heatmap.2'. Let me know if you would like to see tutorials specifically on any of those topics.

UPDATE: Since this seems to be coming up with some regularity I decided to add some sample code for importing CEL files, normalizing, and annotating to Entrez Gene ID and Symbol, even though this overlaps largely with the tutorial linked above. It illustrates a few slightly different things and also gives a concrete example for the 'Affymetrix Human Gene 1.0 ST Array'.

#install the core bioconductor packages, if not already installed

# install additional bioconductor libraries, if not already installed

#Load the necessary libraries

#Set working directory for download

#Download the CEL file package for this dataset (by GSE - Geo series id)

#Unpack the CEL files
untar("GSE27447_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="/"), gunzip)
cels = list.files("data/", pattern = "CEL")

raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="hugene10stv1") #From bioconductor

#perform RMA normalization (I would normally use GCRMA but it did not work with this chip)

#Get the important stuff out of the data - the expression estimates for each array

#Format values to 5 decimal places
rma=format(rma, digits=5)

#Map probe sets to gene symbols or other annotations
#To see all available mappings for this platform
ls("package:hugene10stprobeset.db") #Annotations at the exon probeset level
ls("package:hugene10sttranscriptcluster.db") #Annotations at the transcript-cluster level (more gene-centric view)

#Extract probe ids, entrez symbols, and entrez ids
Symbols = unlist(mget(probes, hugene10sttranscriptclusterSYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hugene10sttranscriptclusterENTREZID, ifnotfound=NA))

#Combine gene annotations with raw data

#Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)

This produces a tab-delimited text file of the following format. Note that many probes will have "NA" for gene symbol and Entrez ID.

probes Symbols Entrez_IDs GSM678364_B2.CEL GSM678365_B4.CEL GSM678366_B5.CEL ...

7897441 H6PD 9563 6.5943 7.0552 7.5201 ...

7897449 SPSB1 80176 6.9727 7.0281 7.2285 ...

7897460 SLC25A33 84275 7.6659 7.4289 7.9707 ...

ADD COMMENTlink modified 7 months ago by Kevin Blighe21k • written 5.7 years ago by Obi Griffith16k

I wanted to perform WGCNA analysis on Affy cell files. So according to this tutorial, the "rma.txt" could be used as the input for WGCNA ?

ADD REPLYlink written 5.1 years ago by pixie@bioinfo1.3k

I've not used the WGCNA R package. But, if it starts with a gene expression matrix as input, then yes, it seems like you could use the RMA normalized gene expression values for your analysis.

ADD REPLYlink written 5.1 years ago by Obi Griffith16k

Thanks..yea, the first column has the list of identifiers followed by the expression profiles of all the samples.

ADD REPLYlink written 5.1 years ago by pixie@bioinfo1.3k

HI, my computer gets stuck every time i use the annotation package( the cbind command). Is there anyway to overcome that? I have also tried allotting heavy memory on our computing cluster and yet the problem is unsolved

ADD REPLYlink written 4.4 years ago by swe.srini20

This so helpful! I've been trying to figure out all these for months.. Thank you!

ADD REPLYlink written 5.7 years ago by Ane180

No problem. Although, thanks should go to the good people at the Bioinformatics Knowledgeblog who created these tutorials. ps. For future, comments/thanks like this would be better as a "comment" to the main post rather than as an "answer".

ADD REPLYlink written 5.7 years ago by Obi Griffith16k

Dear Obi,

I have normalized my affymetrix microarray data and did gene filter to remove probes that do not have entrez id's.

Now I want to annotate my data for gene symbols. I've followed your steps to retrieve gene symbols. After writing the data in text file I found a column mentioning as <S4 object of class "ExpressionSet"> instead of expression values of the sample. can you please help me how to get expression values


ADD REPLYlink written 2.9 years ago by rohit30

The limma user manual also goes into linear models and is very useful for analysis of illumina microarray data

ADD REPLYlink written 5.7 years ago by Ying W3.8k

Very useful. Thanks.

ADD REPLYlink written 5.1 years ago by Suren90

Hi, this was very good tutorial. You said you might do a heat map. 2, which would be great. I am analysing data on this kind of chip and need to do heat map.2. I would be very grateful to see how you would incorporate heatmap.2 into this tutorial. As I am having difficulty with it. Thanks for your time

ADD REPLYlink written 4.8 years ago by martinkara720

Very good tutorial, thank you 

ADD REPLYlink written 4.0 years ago by sana.mystudies60

Very Good tutorial. Thank you

ADD REPLYlink written 4.0 years ago by sana.mystudies60

very good tutorial. thanks for sharing

ADD REPLYlink written 3.4 years ago by ju.luan20

Hi Obi,

i followed the steps you mentioned today and i have a text file now, how can i visualize the data to interpretation please?

ADD REPLYlink written 3.1 years ago by F3.0k

Thanks for such a nice tutorial

ADD REPLYlink modified 2.8 years ago • written 2.9 years ago by Bioinformatist Newbie220

I found also this It may be helpful http://learn.genetics.utah.edu/content/labs/microarray/

ADD REPLYlink written 2.4 years ago by Medhat7.6k

to avoid converting the entire data frame to a "character" matrix, I would use cbind.data.frame() instead of cbind():

#Combine gene annotations with raw data
ADD REPLYlink written 2.2 years ago by informatics bot530

Can you please tell me how to find the best probeset in microarray analysis?

ADD REPLYlink written 2.2 years ago by Bioinfo200

Dear all,

I tried to use below code to normalize my data before using combat to deal with batch effects:

norm <- ReadAffy()
norm.rma <- rma(norm)

Unluckily, my data include 3 data sets, 2 Affymetrix Human Genome U133 Plus 2.0 Array and 1 Affymetrix Human Gene 1.0 ST Array. I received this error message "Error in affyio::read_abatch(filenames, rm.mask, rm.outliers, rm.extra,: Cel file D:////batchdemo/GSMxxxx.CEL.gz does not seem to have the correct dimensions".

I tried to use dchip instead but again, I failed. I even couldn't input the data to dchip.

Any advice and suggestions will be greatly appreciated.

Thank you.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by BioMed40

@BioMed: Please create a new post with this question, if you feel this is not directly related to Obi's tutorial. It is not appropriate to use Submit Answers on this post to ask a new question.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by genomax49k

Please delete the above comment. I created a new post. Thank you.

ADD REPLYlink written 2.1 years ago by BioMed40

this is very useful,but I have a problem in Describing the experiment step,when I am using command.

$ ls data/*.CEL > data/phenodata.txt

this returns an error

$ ls data/*.CEL > data/phenodata.txt Error: unexpected '$' in "$" what should I do now?

ADD REPLYlink written 17 months ago by Edalat30

$ is your terminal's command prompt end. Don't use $ and try.

ADD REPLYlink written 17 months ago by venu5.3k

thank you for a reply. actually I do not work with linux. do you know same of this tutorial for windows?

ADD REPLYlink written 17 months ago by Edalat30

hi, i tried to run this command data.rma.norm=rma(raw.data) but Unfortunately I faced this error, would you please help me. my platform is GPL96

Error in getCdfInfo(object) : Could not obtain CDF environment, problems encountered: Specified environment does not contain HGU133A_HS_ENTREZG,HGU133B_HS_ENTREZG Library - package hgu133ahsentrezg,hgu133bhsentrezgcdf not installed Bioconductor - hgu133ahsentrezg,hgu133bhsentrezgcdf not available

ADD REPLYlink written 17 months ago by arezoo.yazdani0

for RMA normalization


# To read all CEL files in the working directory:

ADD REPLYlink modified 17 months ago by WouterDeCoster29k • written 17 months ago by F3.0k

Hello Dear Obi Griffith, I am beginner using bioconductor, I like so much your workflow is very explanatory, however in the end based on the data that are generated the file "rma.txt" how do I analyze using the limma pack? Would you have some workflow as good as yours that I can follow?

Note: I am working with two normal states and tumor, how to compare the exprssão between them?

ADD REPLYlink written 7 months ago by Leite240
gravatar for himanshup332
2.3 years ago by
himanshup33210 wrote:

hi, i want to use limma for obtaining list of highly expressed genes in a Dataset(GEO**). will someone help me that how to use R studio for this purpose, the type of input file ,the command step by step, etc till the result.

ADD COMMENTlink written 2.3 years ago by himanshup33210

Search Google 'limma user guide'. You shouldn't ask this as an answer to some other post. You can create a new thread but not for the same question as this is very easy to find with google.

ADD REPLYlink written 2.3 years ago by venu5.3k
gravatar for daniel.naumovas
7 months ago by
daniel.naumovas0 wrote:

Hello, I would like to ask is it possible to use similar functions to work with Agilent data? If not, then can someone tell me where can i find something similar to work with Agilent data.

Thank You

ADD COMMENTlink written 7 months ago by daniel.naumovas0

Hi Daniel, yes, you can analyse Agilent microarrays. The architecture of the Agilent arrays is fundamentally different than others, though, so the processing is different too. Here is some sample code for you, which should assist:


#Read in the data into a dataframe
#readTargets will by default look for the 'FileName' column in the spcified file
targetinfo <- readTargets("Targets.txt", sep="\t")

#Converts the data to a RGList (two-colour [red-green] array), with values for R, Rg, G, Gb
project <- read.maimages(targetinfo, source="agilent")

#Perform background correction on the fluorescent intensities
#'normexp' is beneficial in that it doesn't result in negative values, meaning no data is lost
project.bgcorrect <- backgroundCorrect(project, method="normexp", offset=16)

#Normalize the data with the 'loess' method
#LOESS performs local regression on subsets of the data, resulting in the generation of a 'regression curve'
project.bgcorrect.norm <- normalizeWithinArrays(project.bgcorrect, method="loess")

#For replicate probes in each sample, replace values with the average
#ID is used to identify the replicates
project.bgcorrect.norm.avg <- avereps(project.bgcorrect.norm, ID=project.bgcorrect.norm$genes$ProbeName)

#Write out the expression values (using CEL file name as headers)
wObject <- project.bgcorrect.norm.avg$M
rownames(wObject) <- project.bgcorrect.norm.avg$genes$GeneName
temp <- temp[order(rownames(wObject), decreasing=FALSE),]
write.table(wObject, "Results/MasterExpression.tsv", sep="\t", quote=FALSE, append=FALSE)

#Summarise over gene names (averaging counts over transcripts)
temp <- data.frame(cbind(data.frame(project.bgcorrect.norm.avg$genes$GeneName), data.frame(project.bgcorrect.norm.avg$M)))
project.bgcorrect.norm.avg.summarised <- aggregate(temp[,2:ncol(temp)], temp[1], mean)
rownames(project.bgcorrect.norm.avg.summarised) <- project.bgcorrect.norm.avg.summarised[,1]
project.bgcorrect.norm.avg.summarised <- project.bgcorrect.norm.avg.summarised[,2:ncol(project.bgcorrect.norm.avg.summarised)]
write.table(project.bgcorrect.norm.avg.summarised, "Results/MasterExpressionAverages.tsv", sep="\t", quote=FALSE, append=FALSE)
ADD REPLYlink modified 7 months ago • written 7 months ago by Kevin Blighe21k

Oh my, such a quick response! Thank you very much!

ADD REPLYlink written 7 months ago by daniel.naumovas0

You're welcome! Hope that it helps.

ADD REPLYlink written 7 months ago by Kevin Blighe21k

Kevin i need a little bit more help :D, I am newbie at bioinformatics and especialy at working with R. When I use function:

targetinfo <- readTargets("Targets.txt", sep="\t")

I get this error:

" Error in read.table(file, header = TRUE, stringsAsFactors = FALSE, sep = sep,  : 
 more columns than column names "

I am working with GSE75915 data.

ADD REPLYlink modified 7 months ago by WouterDeCoster29k • written 7 months ago by daniel.naumovas0

daniel.naumovas : Have you considered using GEO2R which is a built-in analysis application available from NCBI GEO?

ADD REPLYlink written 7 months ago by genomax49k

I was going to suggest something like that because the files that the group have made available are not recognised as they don't have the expected column names for the 2-colour Agilent array. It looks like they have been processed to a certain point.

I can't make my own data available as it's protected by agreements.

ADD REPLYlink written 7 months ago by Kevin Blighe21k

I have tried GEO2R, but you cant do normalization with it :/ Need normalized data to work with WGCNA

ADD REPLYlink written 7 months ago by daniel.naumovas0

Hello again, the data is already normalised.

I download the normalised [edit:] intensities from here: ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE75nnn/GSE75915/matrix/

GSE75915 <- read.csv("GSE75915_series_matrix.txt", skip=76, header=TRUE, sep="\t", stringsAsFactors=FALSE)
GSE75915 <- GSE75915[grep("A_", GSE75915[,1]),] #Indirectly remove control probes
rownames(GSE75915) <- GSE75915[,1]
GSE75915 <- GSE75915[,-1]
[1] 34137    18
boxplot(GSE75915, pch=".")
hist(data.matrix(GSE75915), breaks=150)

boxplot hist
image uploader

If you want to annotate these probe IDs, I suggest that you do it manually by downloading the human annotation from Agilent's website, here. Alternatively, take a look at the annotate package in R.


ADD REPLYlink modified 7 months ago • written 7 months ago by Kevin Blighe21k

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 7 months ago by WouterDeCoster29k

Hello, my targets file just contains this (tab-delimited):

FileName WT_KO Time

SampleFiles/File1.txt WT 4Wk

SampleFiles/File2.txt KO 4Dy

SampleFiles/File3.txt KO 7Dy

SampleFiles/File4.txt WT 4Wk

et cetera

The files (txt or CEL) are in a directory called SampleFiles. I think that the column name has to be 'FileName', whist the others can be anything (these others are used for limma).

Does that help?

ADD REPLYlink modified 7 months ago • written 7 months ago by Kevin Blighe21k

Hmm I cant think which column should be named FileName. Maybe your data is available in GEO db so i could use it as example?

ADD REPLYlink written 7 months ago by daniel.naumovas0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 923 users visited in the last hour