Question: Preparing Microarray Data to analyze
1
gravatar for shinobee
21 months ago by
shinobee40
shinobee40 wrote:

I couldn't find a clear solution how to prepare microarray data to analyze it. I have 2 examples:

1.  install.packages("plsgenomics")
library("plsgenomics")
data(Colon)
data<-t(Colon$X)
rownames(data)<-Colon$gene.names
grp<-Colon$Y
grp = as.factor(grp)

If we check boxplot we see IQR's of each 62 samples do not match.

boxplot(data)

To solve this problem I tried following code, is it correct?

data = log(data)
data = scale(data)

If we check the side by side boxplot again we can observe IQR and medians match. We can also check mean and sd of each samples:

apply(data, 2,mean)
apply(data, 2, sd)

All means are around 0 and sd's are around 1.

2. source("http://bioconductor.org/biocLite.R")
biocLite("golubEsets")
library('golubEsets')  

two_classes <- TRUE    
# The training data set.
data('Golub_Train')
x <- t(exprs(Golub_Train))
y <- as.vector(pData(Golub_Train)$ALL.AML)

if(!two_classes) {
  y[pData(Golub_Merge)$T.B.cell == "B-cell"] <- "ALL-B"
  y[pData(Golub_Merge)$T.B.cell == "T-cell"] <- "ALL-T"
}

golub_train <- list(x = x, y = factor(y))

# The test data set.
data('Golub_Test')
x <- t(exprs(Golub_Test))
y <- as.vector(pData(Golub_Test)$ALL.AML)

if(!two_classes) {
  y[pData(Golub_Test)$T.B.cell == "B-cell"] <- "ALL-B"
  y[pData(Golub_Test)$T.B.cell == "T-cell"] <- "ALL-T"
}

golub_test <- list(x = x, y = factor(y))


# The combined training and test data sets.
data('Golub_Merge')
x <- t(exprs(Golub_Merge))
y <- as.vector(pData(Golub_Merge)$ALL.AML)

if(!two_classes) {
  y[pData(Golub_Merge)$T.B.cell == "B-cell"] <- "ALL-B"
  y[pData(Golub_Merge)$T.B.cell == "T-cell"] <- "ALL-T"
}

golub <- list(x = x, y = factor(y))
data<-t(golub$x)
grp<-golub$y

I want to use a variance filter first and select top 2000 genes.

RowVar <- function(x) {
rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)
}
rvar = RowVar(data)
qt <- quantile(rvar,probs = 1-2000/dim(data)[1],na.rm = T)
data <- data[apply( data , 1 , function(x) var(x) > qt[1] ),]

Again let's check side by side boxplot we observe IQR and medians mismatch again. We can't take the log directly as there are negative values. I use following idea:

data = data - min(data) + 0.01
data = log(data)
data = scale(data)

We can't see the match of IQR and match of medians like we did in the first example, but results are much better than raw expressions.

Do you think they are correct or have any better-preprocessing suggestions?

R • 685 views
ADD COMMENTlink written 21 months ago by shinobee40

You could check limma package.

ADD REPLYlink written 21 months ago by boczniak767630
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour