Tutorial:Analysing Microarray Data In Bioconductor
2
140
Entering edit mode
8.6 years ago

I was thinking about creating a tutorial on how to do a simple microarray analysis in Bioconductor. But, I realized this has already been done quite nicely at the Bioinformatics Knowledgeblog. Their first tutorial on the subject covers installation of necessary packages, downloading of cel files, describing the experiment, loading and normalizing data, quality controls, probe set filtering, finding differentially expressed probe sets, and finally annotating those probe sets to gene symbols. They have follow-up tutorials on visualizing microarray data with volcano plots and assessing the biological significance of results. All three seem quite excellent after a brief perusal.

I might have also included: filtering with 'genefilter', multiple testing correction with 'multtest', and visualizations with 'heatmap.2'. Let me know if you would like to see tutorials specifically on any of those topics.

UPDATE: Since this seems to be coming up with some regularity I decided to add some sample code for importing CEL files, normalizing, and annotating to Entrez Gene ID and Symbol, even though this overlaps largely with the tutorial linked above. It illustrates a few slightly different things and also gives a concrete example for the 'Affymetrix Human Gene 1.0 ST Array'.

#install the core bioconductor packages, if not already installed
source("http://bioconductor.org/biocLite.R")
biocLite()

biocLite("GEOquery")
biocLite("affy")
biocLite("gcrma")
biocLite("hugene10stv1cdf")
biocLite("hugene10stv1probe")
biocLite("hugene10stprobeset.db")
biocLite("hugene10sttranscriptcluster.db")

library(GEOquery)
library(affy)
library(gcrma)
library(hugene10stv1cdf)
library(hugene10stv1probe)
library(hugene10stprobeset.db)
library(hugene10sttranscriptcluster.db)

setwd("/Users/ogriffit/Dropbox/BioStars")

#Download the CEL file package for this dataset (by GSE - Geo series id)
getGEOSuppFiles("GSE27447")

#Unpack the CEL files
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447")
untar("GSE27447_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="/"), gunzip)
cels = list.files("data/", pattern = "CEL")

setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447/data")

#perform RMA normalization (I would normally use GCRMA but it did not work with this chip)
data.rma.norm=rma(raw.data)

#Get the important stuff out of the data - the expression estimates for each array
rma=exprs(data.rma.norm)

#Format values to 5 decimal places
rma=format(rma, digits=5)

#Map probe sets to gene symbols or other annotations
#To see all available mappings for this platform
ls("package:hugene10stprobeset.db") #Annotations at the exon probeset level
ls("package:hugene10sttranscriptcluster.db") #Annotations at the transcript-cluster level (more gene-centric view)

#Extract probe ids, entrez symbols, and entrez ids
probes=row.names(rma)
Symbols = unlist(mget(probes, hugene10sttranscriptclusterSYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hugene10sttranscriptclusterENTREZID, ifnotfound=NA))

#Combine gene annotations with raw data
rma=cbind(probes,Symbols,Entrez_IDs,rma)

#Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)


This produces a tab-delimited text file of the following format. Note that many probes will have "NA" for gene symbol and Entrez ID.

probes Symbols Entrez_IDs GSM678364_B2.CEL GSM678365_B4.CEL GSM678366_B5.CEL ...

7897441 H6PD 9563 6.5943 7.0552 7.5201 ...

7897449 SPSB1 80176 6.9727 7.0281 7.2285 ...

7897460 SLC25A33 84275 7.6659 7.4289 7.9707 ...

4
Entering edit mode

I wanted to perform WGCNA analysis on Affy cell files. So according to this tutorial, the "rma.txt" could be used as the input for WGCNA ?

4
Entering edit mode

I've not used the WGCNA R package. But, if it starts with a gene expression matrix as input, then yes, it seems like you could use the RMA normalized gene expression values for your analysis.

3
Entering edit mode

Thanks..yea, the first column has the list of identifiers followed by the expression profiles of all the samples.

2
Entering edit mode

HI, my computer gets stuck every time i use the annotation package( the cbind command). Is there anyway to overcome that? I have also tried allotting heavy memory on our computing cluster and yet the problem is unsolved

3
Entering edit mode

This so helpful! I've been trying to figure out all these for months.. Thank you!

1
Entering edit mode

No problem. Although, thanks should go to the good people at the Bioinformatics Knowledgeblog who created these tutorials. ps. For future, comments/thanks like this would be better as a "comment" to the main post rather than as an "answer".

0
Entering edit mode

Dear Obi,

I have normalized my affymetrix microarray data and did gene filter to remove probes that do not have entrez id's.

Now I want to annotate my data for gene symbols. I've followed your steps to retrieve gene symbols. After writing the data in text file I found a column mentioning as <S4 object of class "ExpressionSet"> instead of expression values of the sample. can you please help me how to get expression values

3
Entering edit mode

The limma user manual also goes into linear models and is very useful for analysis of illumina microarray data

3
Entering edit mode

Very useful. Thanks.

2
Entering edit mode

Hi, this was very good tutorial. You said you might do a heat map. 2, which would be great. I am analysing data on this kind of chip and need to do heat map.2. I would be very grateful to see how you would incorporate heatmap.2 into this tutorial. As I am having difficulty with it. Thanks for your time

2
Entering edit mode

Very good tutorial, thank you

2
Entering edit mode

Very Good tutorial. Thank you

2
Entering edit mode

very good tutorial. thanks for sharing

1
Entering edit mode

Hi Obi,

i followed the steps you mentioned today and i have a text file now, how can i visualize the data to interpretation please?

1
Entering edit mode

Thanks for such a nice tutorial

1
Entering edit mode

I found also this It may be helpful http://learn.genetics.utah.edu/content/labs/microarray/

1
Entering edit mode

to avoid converting the entire data frame to a "character" matrix, I would use cbind.data.frame() instead of cbind():

#Combine gene annotations with raw data
rma=cbind.data.frame(probes,Symbols,Entrez_IDs,rma)

1
Entering edit mode

Can you please tell me how to find the best probeset in microarray analysis?

1
Entering edit mode

Dear all,

I tried to use below code to normalize my data before using combat to deal with batch effects:

library(affy)
norm.rma <- rma(norm)

Unluckily, my data include 3 data sets, 2 Affymetrix Human Genome U133 Plus 2.0 Array and 1 Affymetrix Human Gene 1.0 ST Array. I received this error message "Error in affyio::read_abatch(filenames, rm.mask, rm.outliers, rm.extra,: Cel file D:////batchdemo/GSMxxxx.CEL.gz does not seem to have the correct dimensions".

I tried to use dchip instead but again, I failed. I even couldn't input the data to dchip.

Any advice and suggestions will be greatly appreciated.

Thank you.

1
Entering edit mode

@BioMed: Please create a new post with this question, if you feel this is not directly related to Obi's tutorial. It is not appropriate to use Submit Answers on this post to ask a new question.

0
Entering edit mode

Please delete the above comment. I created a new post. Thank you.

0
Entering edit mode

this is very useful,but I have a problem in Describing the experiment step,when I am using command.

$ls data/*.CEL > data/phenodata.txt this returns an error$ ls data/*.CEL > data/phenodata.txt Error: unexpected '$' in "$" what should I do now?

0
Entering edit mode

$ is your terminal's command prompt end. Don't use $ and try.

0
Entering edit mode

thank you for a reply. actually I do not work with linux. do you know same of this tutorial for windows?

0
Entering edit mode

hi, i tried to run this command data.rma.norm=rma(raw.data) but Unfortunately I faced this error, would you please help me. my platform is GPL96

Error in getCdfInfo(object) : Could not obtain CDF environment, problems encountered: Specified environment does not contain HGU133A_HS_ENTREZG,HGU133B_HS_ENTREZG Library - package hgu133ahsentrezg,hgu133bhsentrezgcdf not installed Bioconductor - hgu133ahsentrezg,hgu133bhsentrezgcdf not available

0
Entering edit mode

for RMA normalization

library(affy)

# To read all CEL files in the working directory:

eset<-rma(Data)
norm.data<-exprs(eset)

0
Entering edit mode

Hello Dear Obi Griffith, I am beginner using bioconductor, I like so much your workflow is very explanatory, however in the end based on the data that are generated the file "rma.txt" how do I analyze using the limma pack? Would you have some workflow as good as yours that I can follow?

Note: I am working with two normal states and tumor, how to compare the exprssão between them?

0
Entering edit mode

Hi! Does somebody know if the tutorial from theBioinformatics KnowledgeBlog is still available somewhere? I cannot open it anymore and I am struggling with the exact same task.. Thanks a lot!

1
Entering edit mode
5.1 years ago
himanshup332 ▴ 10

hi, i want to use limma for obtaining list of highly expressed genes in a Dataset(GEO**). will someone help me that how to use R studio for this purpose, the type of input file ,the command step by step, etc till the result.

2
Entering edit mode

Search Google 'limma user guide'. You shouldn't ask this as an answer to some other post. You can create a new thread but not for the same question as this is very easy to find with google.

1
Entering edit mode
0
Entering edit mode
3.5 years ago

Hello, I would like to ask is it possible to use similar functions to work with Agilent data? If not, then can someone tell me where can i find something similar to work with Agilent data.

Thank You

2
Entering edit mode

Hi Daniel, yes, you can analyse Agilent microarrays. The architecture of the Agilent arrays is fundamentally different than others, though, so the processing is different too. Here is some sample code for you, which should assist:

source("http://bioconductor.org/biocLite.R")
options(scipen=999)
require("limma")

#Read in the data into a dataframe
#readTargets will by default look for the 'FileName' column in the spcified file

#Converts the data to a RGList (two-colour [red-green] array), with values for R, Rg, G, Gb

#Perform background correction on the fluorescent intensities
#'normexp' is beneficial in that it doesn't result in negative values, meaning no data is lost
project.bgcorrect <- backgroundCorrect(project, method="normexp", offset=16)

#Normalize the data with the 'loess' method
#LOESS performs local regression on subsets of the data, resulting in the generation of a 'regression curve'
project.bgcorrect.norm <- normalizeWithinArrays(project.bgcorrect, method="loess")

#For replicate probes in each sample, replace values with the average
#ID is used to identify the replicates
project.bgcorrect.norm.avg <- avereps(project.bgcorrect.norm, ID=project.bgcorrect.norm$genes$ProbeName)

#Write out the expression values (using CEL file name as headers)
wObject <- project.bgcorrect.norm.avg$M rownames(wObject) <- project.bgcorrect.norm.avg$genes$GeneName temp <- temp[order(rownames(wObject), decreasing=FALSE),] write.table(wObject, "Results/MasterExpression.tsv", sep="\t", quote=FALSE, append=FALSE) #Summarise over gene names (averaging counts over transcripts) temp <- data.frame(cbind(data.frame(project.bgcorrect.norm.avg$genes$GeneName), data.frame(project.bgcorrect.norm.avg$M)))
project.bgcorrect.norm.avg.summarised <- aggregate(temp[,2:ncol(temp)], temp[1], mean)
rownames(project.bgcorrect.norm.avg.summarised) <- project.bgcorrect.norm.avg.summarised[,1]
project.bgcorrect.norm.avg.summarised <- project.bgcorrect.norm.avg.summarised[,2:ncol(project.bgcorrect.norm.avg.summarised)]
write.table(project.bgcorrect.norm.avg.summarised, "Results/MasterExpressionAverages.tsv", sep="\t", quote=FALSE, append=FALSE)

0
Entering edit mode

Oh my, such a quick response! Thank you very much!

1
Entering edit mode

You're welcome! Hope that it helps.

0
Entering edit mode

Kevin i need a little bit more help :D, I am newbie at bioinformatics and especialy at working with R. When I use function:

targetinfo <- readTargets("Targets.txt", sep="\t")


I get this error:

" Error in read.table(file, header = TRUE, stringsAsFactors = FALSE, sep = sep,  :
more columns than column names "


I am working with GSE75915 data.

2
Entering edit mode

daniel.naumovas : Have you considered using GEO2R which is a built-in analysis application available from NCBI GEO?

1
Entering edit mode

I was going to suggest something like that because the files that the group have made available are not recognised as they don't have the expected column names for the 2-colour Agilent array. It looks like they have been processed to a certain point.

I can't make my own data available as it's protected by agreements.

0
Entering edit mode

I have tried GEO2R, but you cant do normalization with it :/ Need normalized data to work with WGCNA

2
Entering edit mode

Hello again, the data is already normalised.

GSE75915 <- read.csv("GSE75915_series_matrix.txt", skip=76, header=TRUE, sep="\t", stringsAsFactors=FALSE)
GSE75915 <- GSE75915[grep("A_", GSE75915[,1]),] #Indirectly remove control probes
rownames(GSE75915) <- GSE75915[,1]
GSE75915 <- GSE75915[,-1]
dim(GSE75915)
[1] 34137    18
boxplot(GSE75915, pch=".")
hist(data.matrix(GSE75915), breaks=150)


If you want to annotate these probe IDs, I suggest that you do it manually by downloading the human annotation from Agilent's website, here. Alternatively, take a look at the annotate package in R.

Kevin

1
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

0
Entering edit mode

Hello, my targets file just contains this (tab-delimited):

FileName WT_KO Time

SampleFiles/File1.txt WT 4Wk

SampleFiles/File2.txt KO 4Dy

SampleFiles/File3.txt KO 7Dy

SampleFiles/File4.txt WT 4Wk

et cetera

The files (txt or CEL) are in a directory called SampleFiles. I think that the column name has to be 'FileName', whist the others can be anything (these others are used for limma).

Does that help?

0
Entering edit mode

Hmm I cant think which column should be named FileName. Maybe your data is available in GEO db so i could use it as example?