Question: Preparing Microarray Data to analyze
1
gravatar for shinobee
2.2 years ago by
shinobee40
shinobee40 wrote:

I couldn't find a clear solution how to prepare microarray data to analyze it. I have 2 examples:

1.  install.packages("plsgenomics")
library("plsgenomics")
data(Colon)
data<-t(Colon$X)
rownames(data)<-Colon$gene.names
grp<-Colon$Y
grp = as.factor(grp)

If we check boxplot we see IQR's of each 62 samples do not match.

boxplot(data)

To solve this problem I tried following code, is it correct?

data = log(data)
data = scale(data)

If we check the side by side boxplot again we can observe IQR and medians match. We can also check mean and sd of each samples:

apply(data, 2,mean)
apply(data, 2, sd)

All means are around 0 and sd's are around 1.

2. source("http://bioconductor.org/biocLite.R")
biocLite("golubEsets")
library('golubEsets')  

two_classes <- TRUE    
# The training data set.
data('Golub_Train')
x <- t(exprs(Golub_Train))
y <- as.vector(pData(Golub_Train)$ALL.AML)

if(!two_classes) {
  y[pData(Golub_Merge)$T.B.cell == "B-cell"] <- "ALL-B"
  y[pData(Golub_Merge)$T.B.cell == "T-cell"] <- "ALL-T"
}

golub_train <- list(x = x, y = factor(y))

# The test data set.
data('Golub_Test')
x <- t(exprs(Golub_Test))
y <- as.vector(pData(Golub_Test)$ALL.AML)

if(!two_classes) {
  y[pData(Golub_Test)$T.B.cell == "B-cell"] <- "ALL-B"
  y[pData(Golub_Test)$T.B.cell == "T-cell"] <- "ALL-T"
}

golub_test <- list(x = x, y = factor(y))


# The combined training and test data sets.
data('Golub_Merge')
x <- t(exprs(Golub_Merge))
y <- as.vector(pData(Golub_Merge)$ALL.AML)

if(!two_classes) {
  y[pData(Golub_Merge)$T.B.cell == "B-cell"] <- "ALL-B"
  y[pData(Golub_Merge)$T.B.cell == "T-cell"] <- "ALL-T"
}

golub <- list(x = x, y = factor(y))
data<-t(golub$x)
grp<-golub$y

I want to use a variance filter first and select top 2000 genes.

RowVar <- function(x) {
rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)
}
rvar = RowVar(data)
qt <- quantile(rvar,probs = 1-2000/dim(data)[1],na.rm = T)
data <- data[apply( data , 1 , function(x) var(x) > qt[1] ),]

Again let's check side by side boxplot we observe IQR and medians mismatch again. We can't take the log directly as there are negative values. I use following idea:

data = data - min(data) + 0.01
data = log(data)
data = scale(data)

We can't see the match of IQR and match of medians like we did in the first example, but results are much better than raw expressions.

Do you think they are correct or have any better-preprocessing suggestions?

R • 810 views
ADD COMMENTlink written 2.2 years ago by shinobee40

You could check limma package.

ADD REPLYlink written 2.2 years ago by boczniak767640
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 761 users visited in the last hour