I was thinking about creating a tutorial on how to do a simple microarray analysis in Bioconductor. But, I realized this has already been done quite nicely at the Bioinformatics Knowledgeblog. Their first tutorial on the subject covers installation of necessary packages, downloading of cel files, describing the experiment, loading and normalizing data, quality controls, probe set filtering, finding differentially expressed probe sets, and finally annotating those probe sets to gene symbols. They have follow-up tutorials on visualizing microarray data with volcano plots and assessing the biological significance of results. All three seem quite excellent after a brief perusal.
I might have also included: filtering with 'genefilter', multiple testing correction with 'multtest', and visualizations with 'heatmap.2'. Let me know if you would like to see tutorials specifically on any of those topics.
UPDATE: Since this seems to be coming up with some regularity I decided to add some sample code for importing CEL files, normalizing, and annotating to Entrez Gene ID and Symbol, even though this overlaps largely with the tutorial linked above. It illustrates a few slightly different things and also gives a concrete example for the 'Affymetrix Human Gene 1.0 ST Array'.
#install the core bioconductor packages, if not already installed
source("http://bioconductor.org/biocLite.R")
biocLite()
# install additional bioconductor libraries, if not already installed
biocLite("GEOquery")
biocLite("affy")
biocLite("gcrma")
biocLite("hugene10stv1cdf")
biocLite("hugene10stv1probe")
biocLite("hugene10stprobeset.db")
biocLite("hugene10sttranscriptcluster.db")
#Load the necessary libraries
library(GEOquery)
library(affy)
library(gcrma)
library(hugene10stv1cdf)
library(hugene10stv1probe)
library(hugene10stprobeset.db)
library(hugene10sttranscriptcluster.db)
#Set working directory for download
setwd("/Users/ogriffit/Dropbox/BioStars")
#Download the CEL file package for this dataset (by GSE - Geo series id)
getGEOSuppFiles("GSE27447")
#Unpack the CEL files
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447")
untar("GSE27447_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="/"), gunzip)
cels = list.files("data/", pattern = "CEL")
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447/data")
raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="hugene10stv1") #From bioconductor
#perform RMA normalization (I would normally use GCRMA but it did not work with this chip)
data.rma.norm=rma(raw.data)
#Get the important stuff out of the data - the expression estimates for each array
rma=exprs(data.rma.norm)
#Format values to 5 decimal places
rma=format(rma, digits=5)
#Map probe sets to gene symbols or other annotations
#To see all available mappings for this platform
ls("package:hugene10stprobeset.db") #Annotations at the exon probeset level
ls("package:hugene10sttranscriptcluster.db") #Annotations at the transcript-cluster level (more gene-centric view)
#Extract probe ids, entrez symbols, and entrez ids
probes=row.names(rma)
Symbols = unlist(mget(probes, hugene10sttranscriptclusterSYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hugene10sttranscriptclusterENTREZID, ifnotfound=NA))
#Combine gene annotations with raw data
rma=cbind(probes,Symbols,Entrez_IDs,rma)
#Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
This produces a tab-delimited text file of the following format. Note that many probes will have "NA" for gene symbol and Entrez ID.
probes Symbols Entrez_IDs GSM678364_B2.CEL GSM678365_B4.CEL GSM678366_B5.CEL ...
7897441 H6PD 9563 6.5943 7.0552 7.5201 ...
7897449 SPSB1 80176 6.9727 7.0281 7.2285 ...
7897460 SLC25A33 84275 7.6659 7.4289 7.9707 ...
I wanted to perform WGCNA analysis on Affy cell files. So according to this tutorial, the "rma.txt" could be used as the input for WGCNA ?
I've not used the WGCNA R package. But, if it starts with a gene expression matrix as input, then yes, it seems like you could use the RMA normalized gene expression values for your analysis.
Thanks..yea, the first column has the list of identifiers followed by the expression profiles of all the samples.
HI, my computer gets stuck every time i use the annotation package( the cbind command). Is there anyway to overcome that? I have also tried allotting heavy memory on our computing cluster and yet the problem is unsolved
This so helpful! I've been trying to figure out all these for months.. Thank you!
No problem. Although, thanks should go to the good people at the Bioinformatics Knowledgeblog who created these tutorials. ps. For future, comments/thanks like this would be better as a "comment" to the main post rather than as an "answer".
Dear Obi,
I have normalized my affymetrix microarray data and did gene filter to remove probes that do not have entrez id's.
Now I want to annotate my data for gene symbols. I've followed your steps to retrieve gene symbols. After writing the data in text file I found a column mentioning as <S4 object of class "ExpressionSet"> instead of expression values of the sample. can you please help me how to get expression values
The limma user manual also goes into linear models and is very useful for analysis of illumina microarray data
Very useful. Thanks.
Hi, this was very good tutorial. You said you might do a heat map. 2, which would be great. I am analysing data on this kind of chip and need to do heat map.2. I would be very grateful to see how you would incorporate heatmap.2 into this tutorial. As I am having difficulty with it. Thanks for your time
Very good tutorial, thank you
very good tutorial. thanks for sharing
Hi Obi,
i followed the steps you mentioned today and i have a text file now, how can i visualize the data to interpretation please?
Thanks for such a nice tutorial
I found also this It may be helpful http://learn.genetics.utah.edu/content/labs/microarray/
to avoid converting the entire data frame to a "character" matrix, I would use cbind.data.frame() instead of cbind():
Can you please tell me how to find the best probeset in microarray analysis?
Dear all,
I tried to use below code to normalize my data before using combat to deal with batch effects:
library(affy)
norm <- ReadAffy()
norm.rma <- rma(norm)
Unluckily, my data include 3 data sets, 2 Affymetrix Human Genome U133 Plus 2.0 Array and 1 Affymetrix Human Gene 1.0 ST Array. I received this error message "Error in affyio::read_abatch(filenames, rm.mask, rm.outliers, rm.extra,: Cel file D:////batchdemo/GSMxxxx.CEL.gz does not seem to have the correct dimensions".
I tried to use dchip instead but again, I failed. I even couldn't input the data to dchip.
Any advice and suggestions will be greatly appreciated.
Thank you.
@BioMed: Please create a new post with this question, if you feel this is not directly related to Obi's tutorial. It is not appropriate to use
Submit Answers
on this post to ask a new question.Please delete the above comment. I created a new post. Thank you.
this is very useful,but I have a problem in Describing the experiment step,when I am using command.
$ ls data/*.CEL > data/phenodata.txt
this returns an error
$ ls data/*.CEL > data/phenodata.txt Error: unexpected '$' in "$" what should I do now?
$
is your terminal's command prompt end. Don't use$
and try.thank you for a reply. actually I do not work with linux. do you know same of this tutorial for windows?
hi, i tried to run this command data.rma.norm=rma(raw.data) but Unfortunately I faced this error, would you please help me. my platform is GPL96
Error in getCdfInfo(object) : Could not obtain CDF environment, problems encountered: Specified environment does not contain HGU133A_HS_ENTREZG,HGU133B_HS_ENTREZG Library - package hgu133ahsentrezg,hgu133bhsentrezgcdf not installed Bioconductor - hgu133ahsentrezg,hgu133bhsentrezgcdf not available
for RMA normalization
Hello Dear Obi Griffith, I am beginner using bioconductor, I like so much your workflow is very explanatory, however in the end based on the data that are generated the file "rma.txt" how do I analyze using the limma pack? Would you have some workflow as good as yours that I can follow?
Note: I am working with two normal states and tumor, how to compare the exprssão between them?
Hi! Does somebody know if the tutorial from theBioinformatics KnowledgeBlog is still available somewhere? I cannot open it anymore and I am struggling with the exact same task.. Thanks a lot!
Hello sir, Could you please suggest to me any R script for microarray miRNA analysis of the Exiqon miRCURY LNA microRNA array dataset?
Thank you.
I followed this but I am getting the following error with
deseq2
: