Tutorial:Analysing Microarray Data In Bioconductor
0
154
Entering edit mode
10.4 years ago

I was thinking about creating a tutorial on how to do a simple microarray analysis in Bioconductor. But, I realized this has already been done quite nicely at the Bioinformatics Knowledgeblog. Their first tutorial on the subject covers installation of necessary packages, downloading of cel files, describing the experiment, loading and normalizing data, quality controls, probe set filtering, finding differentially expressed probe sets, and finally annotating those probe sets to gene symbols. They have follow-up tutorials on visualizing microarray data with volcano plots and assessing the biological significance of results. All three seem quite excellent after a brief perusal.

I might have also included: filtering with 'genefilter', multiple testing correction with 'multtest', and visualizations with 'heatmap.2'. Let me know if you would like to see tutorials specifically on any of those topics.

UPDATE: Since this seems to be coming up with some regularity I decided to add some sample code for importing CEL files, normalizing, and annotating to Entrez Gene ID and Symbol, even though this overlaps largely with the tutorial linked above. It illustrates a few slightly different things and also gives a concrete example for the 'Affymetrix Human Gene 1.0 ST Array'.

#install the core bioconductor packages, if not already installed
source("http://bioconductor.org/biocLite.R")
biocLite()

biocLite("GEOquery")
biocLite("affy")
biocLite("gcrma")
biocLite("hugene10stv1cdf")
biocLite("hugene10stv1probe")
biocLite("hugene10stprobeset.db")
biocLite("hugene10sttranscriptcluster.db")

library(GEOquery)
library(affy)
library(gcrma)
library(hugene10stv1cdf)
library(hugene10stv1probe)
library(hugene10stprobeset.db)
library(hugene10sttranscriptcluster.db)

setwd("/Users/ogriffit/Dropbox/BioStars")

#Download the CEL file package for this dataset (by GSE - Geo series id)
getGEOSuppFiles("GSE27447")

#Unpack the CEL files
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447")
untar("GSE27447_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="/"), gunzip)
cels = list.files("data/", pattern = "CEL")

setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447/data")

#perform RMA normalization (I would normally use GCRMA but it did not work with this chip)
data.rma.norm=rma(raw.data)

#Get the important stuff out of the data - the expression estimates for each array
rma=exprs(data.rma.norm)

#Format values to 5 decimal places
rma=format(rma, digits=5)

#Map probe sets to gene symbols or other annotations
#To see all available mappings for this platform
ls("package:hugene10stprobeset.db") #Annotations at the exon probeset level
ls("package:hugene10sttranscriptcluster.db") #Annotations at the transcript-cluster level (more gene-centric view)

#Extract probe ids, entrez symbols, and entrez ids
probes=row.names(rma)
Symbols = unlist(mget(probes, hugene10sttranscriptclusterSYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hugene10sttranscriptclusterENTREZID, ifnotfound=NA))

#Combine gene annotations with raw data
rma=cbind(probes,Symbols,Entrez_IDs,rma)

#Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)


This produces a tab-delimited text file of the following format. Note that many probes will have "NA" for gene symbol and Entrez ID.

probes    Symbols        Entrez_IDs    GSM678364_B2.CEL    GSM678365_B4.CEL    GSM678366_B5.CEL    ...
7897441    H6PD        9563            6.5943                 7.0552                7.5201    ...
7897449    SPSB1        80176        6.9727                7.0281                7.2285    ...
7897460    SLC25A33    84275        7.6659                7.4289                7.9707    ...

microarray-analysis bioconductor • 67k views
4
Entering edit mode

I wanted to perform WGCNA analysis on Affy cell files. So according to this tutorial, the "rma.txt" could be used as the input for WGCNA ?

4
Entering edit mode

I've not used the WGCNA R package. But, if it starts with a gene expression matrix as input, then yes, it seems like you could use the RMA normalized gene expression values for your analysis.

3
Entering edit mode

Thanks..yea, the first column has the list of identifiers followed by the expression profiles of all the samples.

2
Entering edit mode

Hi, my computer gets stuck every time I use the annotation package (the cbind command). Is there any way to overcome that? I have also tried allotting heavy memory on our computing cluster and yet the problem is unsolved

3
Entering edit mode

This so helpful! I've been trying to figure out all these for months.. Thank you!

1
Entering edit mode

No problem. Although, thanks should go to the good people at the Bioinformatics Knowledgeblog who created these tutorials. P.S., for future reference, comments/thanks like this would be better as a "comment" to the main post rather than as an "answer".

0
Entering edit mode

Dear Obi,

I have normalized my affymetrix microarray data and did gene filter to remove probes that do not have entrez id's.

Now I want to annotate my data for gene symbols. I've followed your steps to retrieve gene symbols. After writing the data in text file I found a column mentioning as <S4 object of class "ExpressionSet"> instead of expression values of the sample. can you please help me how to get expression values

3
Entering edit mode

The limma user manual also goes into linear models and is very useful for analysis of illumina microarray data

3
Entering edit mode

Very useful. Thanks.

2
Entering edit mode

Hi, this was very good tutorial. You said you might do a heatmap.2, which would be great. I am analysing data on this kind of chip and need to do heatmap.2. I would be very grateful to see how you would incorporate heatmap.2 into this tutorial. As I am having difficulty with it.

2
Entering edit mode

Very good tutorial, thank you

2
Entering edit mode

very good tutorial. thanks for sharing

1
Entering edit mode

Hi Obi,

I followed the steps you mentioned today and I have a text file now, how can I visualize the data to interpretation please?

1
Entering edit mode

Thanks for such a nice tutorial

1
Entering edit mode

I found also this It may be helpful http://learn.genetics.utah.edu/content/labs/microarray/

1
Entering edit mode

Hi, I want to use limma for obtaining list of highly expressed genes in a Dataset(GEO**).

Will someone help me that how to use R studio for this purpose, the type of input file ,the command step by step, etc till the result.

2
Entering edit mode

Search Google 'limma user guide'. You shouldn't ask this as an answer to some other post. You can create a new thread but not for the same question as this is very easy to find with google.

1
Entering edit mode
1
Entering edit mode

To avoid converting the entire data frame to a "character" matrix, I would use cbind.data.frame() instead of cbind():

#Combine gene annotations with raw data
rma=cbind.data.frame(probes,Symbols,Entrez_IDs,rma)

1
Entering edit mode

Can you please tell me how to find the best probeset in microarray analysis?

1
Entering edit mode

I followed this but I am getting the following error with deseq2:

dds <- DESeq(dds1)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship
Error in estimateDispersionsFit(object, fitType = fitType, quiet = quiet) :
all gene-wise dispersion estimates are within 2 orders of magnitude   from the minimum value, and so the standard curve fitting techniques will not work.
One can instead use the gene-wise estimates as final estimates:
dds <- estimateDispersionsGeneEst(dds)
dispersions(dds) <- mcols(dds)$dispGeneEst ...then continue with testing using nbinomWaldTest or nbinomLRT  ADD REPLY 0 Entering edit mode This is very useful,but I have a problem in Describing the experiment step,when I am using command. $ ls data/*.CEL > data/phenodata.txt


This returns an error

$ls data/*.CEL > data/phenodata.txt Error: unexpected '$' in "$"  What should I do now? ADD REPLY 0 Entering edit mode $ is your terminal's command prompt end. Don't use $ and try. ADD REPLY 0 Entering edit mode Thank you for a reply. Actually I do not work with linux. Do you know same of this tutorial for windows? ADD REPLY 0 Entering edit mode Hi, I tried to run this command data.rma.norm=rma(raw.data)  but unfortunately I faced this error, would you please help me. my platform is GPL96 Error in getCdfInfo(object) : Could not obtain CDF environment, problems encountered: Specified environment does not contain HGU133A_HS_ENTREZG,HGU133B_HS_ENTREZG Library - package hgu133ahsentrezg,hgu133bhsentrezgcdf not installed Bioconductor - hgu133ahsentrezg,hgu133bhsentrezgcdf not available  ADD REPLY 0 Entering edit mode For RMA normalization library(affy) # To read all CEL files in the working directory: Data<-ReadAffy() eset<-rma(Data) norm.data<-exprs(eset)  ADD REPLY 0 Entering edit mode Hello, I would like to ask is it possible to use similar functions to work with Agilent data? If not, then can someone tell me where can I find something similar to work with Agilent data. Thank you ADD REPLY 2 Entering edit mode Hi Daniel, yes, you can analyse Agilent microarrays. The architecture of the Agilent arrays is fundamentally different than others, though, so the processing is different too. Here is some sample code for you, which should assist: source("http://bioconductor.org/biocLite.R") options(scipen=999) require("limma") #Read in the data into a dataframe #readTargets will by default look for the 'FileName' column in the spcified file targetinfo <- readTargets("Targets.txt", sep="\t") #Converts the data to a RGList (two-colour [red-green] array), with values for R, Rg, G, Gb project <- read.maimages(targetinfo, source="agilent") #Perform background correction on the fluorescent intensities #'normexp' is beneficial in that it doesn't result in negative values, meaning no data is lost project.bgcorrect <- backgroundCorrect(project, method="normexp", offset=16) #Normalize the data with the 'loess' method #LOESS performs local regression on subsets of the data, resulting in the generation of a 'regression curve' project.bgcorrect.norm <- normalizeWithinArrays(project.bgcorrect, method="loess") #For replicate probes in each sample, replace values with the average #ID is used to identify the replicates project.bgcorrect.norm.avg <- avereps(project.bgcorrect.norm, ID=project.bgcorrect.norm$genes$ProbeName) #Write out the expression values (using CEL file name as headers) wObject <- project.bgcorrect.norm.avg$M
rownames(wObject) <- project.bgcorrect.norm.avg$genes$GeneName
temp <- temp[order(rownames(wObject), decreasing=FALSE),]
write.table(wObject, "Results/MasterExpression.tsv", sep="\t", quote=FALSE, append=FALSE)

#Summarise over gene names (averaging counts over transcripts)
temp <- data.frame(cbind(data.frame(project.bgcorrect.norm.avg$genes$GeneName), data.frame(project.bgcorrect.norm.avg\$M)))
project.bgcorrect.norm.avg.summarised <- aggregate(temp[,2:ncol(temp)], temp[1], mean)
rownames(project.bgcorrect.norm.avg.summarised) <- project.bgcorrect.norm.avg.summarised[,1]
project.bgcorrect.norm.avg.summarised <- project.bgcorrect.norm.avg.summarised[,2:ncol(project.bgcorrect.norm.avg.summarised)]
write.table(project.bgcorrect.norm.avg.summarised, "Results/MasterExpressionAverages.tsv", sep="\t", quote=FALSE, append=FALSE)

0
Entering edit mode

Oh my, such a quick response! Thank you very much!

1
Entering edit mode

You're welcome! Hope that it helps.

0
Entering edit mode

Kevin I need a little bit more help :D, I am newbie at bioinformatics and especially at working with R.

When I use function:

targetinfo <- readTargets("Targets.txt", sep="\t")


I get this error:

Error in read.table(file, header = TRUE, stringsAsFactors = FALSE, sep = sep,  :
more columns than column names


I am working with GSE75915 data.

2
Entering edit mode

daniel.naumovas: Have you considered using GEO2R which is a built-in analysis application available from NCBI GEO?

1
Entering edit mode

I was going to suggest something like that because the files that the group have made available are not recognised as they don't have the expected column names for the 2-colour Agilent array. It looks like they have been processed to a certain point.

I can't make my own data available as it's protected by agreements.

0
Entering edit mode

I have tried GEO2R, but you cant do normalization with it :/ Need normalized data to work with WGCNA

2
Entering edit mode

Hello again, the data is already normalised.

GSE75915 <- read.csv("GSE75915_series_matrix.txt", skip=76, header=TRUE, sep="\t", stringsAsFactors=FALSE)
GSE75915 <- GSE75915[grep("A_", GSE75915[,1]),] #Indirectly remove control probes
rownames(GSE75915) <- GSE75915[,1]
GSE75915 <- GSE75915[,-1]
dim(GSE75915)
[1] 34137    18
boxplot(GSE75915, pch=".")
hist(data.matrix(GSE75915), breaks=150)


If you want to annotate these probe IDs, I suggest that you do it manually by downloading the human annotation from Agilent's website, here. Alternatively, take a look at the annotate package in R.

Kevin

1
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

0
Entering edit mode

Hello, my targets file just contains this (tab-delimited):

FileName               WT_KO  Time
SampleFiles/File1.txt  WT     4Wk
SampleFiles/File2.txt  KO     4Dy
SampleFiles/File3.txt  KO     7Dy
SampleFiles/File4.txt  WT     4Wk
..
..


The files (txt or CEL) are in a directory called SampleFiles. I think that the column name has to be FileName, whist the others can be anything (these others are used for limma).

Does that help?

0
Entering edit mode

Hmm I cant think which column should be named FileName. Maybe your data is available in GEO db so I could use it as example?

0
Entering edit mode

Hello Dear Obi Griffith,

I am beginner using bioconductor, I like so much your workflow is very explanatory, however in the end based on the data that are generated the file rma.txt how do I analyze using the limma pack? Would you have some workflow as good as yours that I can follow?

Note: I am working with two normal states and tumor, how to compare the expression between them?

0
Entering edit mode

Hi! Does somebody know if the tutorial from the Bioinformatics KnowledgeBlog is still available somewhere? I cannot open it anymore and I am struggling with the exact same task.. Thanks a lot!

0
Entering edit mode

Hello sir,

Could you please suggest to me any R script for microarray miRNA analysis of the Exiqon miRCURY LNA microRNA array dataset?

Thank you.