Question: Using R To Perform Statistical Tests On Microarray Data And Cluster The Results
gravatar for Walter Jessen
9.5 years ago by
Walter Jessen90 wrote:

In the past, I've used GeneSpring to analyze microarray data. Now I'm teaching myself how to do it all (normalization, statistical testing and clustering) with R. I've worked out everything up to the very end and have several specific questions. Also, please let me know if there's an easier or better way to do something.

I started off with a simple experiment: 25 CEL files - 9 normal samples, 16 Parkinson's disease samples.

My goal is to RMA the data, perform a t-test and cluster the results at a given statistical cutoff.

Here's the code I've been using:


I can't figure out how to cluster the samples using euclidian distance. I can't find any documentation on the different types of metrics available or how to call them (scluster=standard.euclidian doesn't work). Also, I'd like to reference the expression changes to the normal samples such that the normals all show expression around 0 and the PD samples reflect changes up or down from normal. I'm not sure how to do this in R.

For a more complex data set, I couldn't figure out how to use the genefilter package to do an anova (hence the use of the stats package). For this example, I'm using an experiment with 48 CEL files (3 samples at each dose and day) divided into 4 doses (control, doses 1-3) collected on 4 days.

My goal is to RMA the data, perform a one-way anova to identify genes differentially expressed between days and/or doses, and cluster the results at a given statistical cutoff.

Here's the code I've been using:

lm.coef<-function(y) lm(y ~ factor(Day) * factor(Dose), data=pData(pd))$coefficients

Everything works up to this point. However, I can't figure out how to filter the anova results (say, to only look at genes <0.001) and cluster them in a heat map.

Any examples or advice would be greatly appreciated!

gene R affymetrix heatmap • 7.3k views
ADD COMMENTlink modified 4.6 years ago by Biostar ♦♦ 20 • written 9.5 years ago by Walter Jessen90
gravatar for David Quigley
9.5 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

The standard dist() function in R uses Euclidian distance by default; to cluster the columns of a matrix, you could use

 d = dist( t(M) ) 
 h = hclust(d)

For your second question, it sounds like what you want to do is standardize your data set against one set ("normal" samples) on a gene-by-gene basis. I suppose you could calculate the means of just those samples and then mean-center all samples against that mean. To do this with rowMeans, assuming the first 9 samples are your normals, do something like

normal.means = rowMeans(M[,1:9])
M.normals.standardized = M - normal.means

For log2 data, this would convert your results to log change from mean normal; if your normals are homogenous, that should bend the plot the way you want. You would have to explain pretty carefully what you did.

For your filtering question, you can grab the p-values out of a linear model with the summary() function. Example:

x = c(1,2,3,4,5)
y = c(1,2,5,7,5)
summary(lm(x~y))$coefficients[2,4]  # 0.07, P value of fit for Y

You should be able to adapt that to filter a more complex model. Remember to think about multiple testing issues.

ADD COMMENTlink written 9.5 years ago by David Quigley11k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1940 users visited in the last hour