Preparing Microarray Data to analyze
0
1
Entering edit mode
7.3 years ago
shinobee ▴ 60

I couldn't find a clear solution how to prepare microarray data to analyze it. I have 2 examples:

1.  install.packages("plsgenomics")
library("plsgenomics")
data(Colon)
data<-t(Colon$X)
rownames(data)<-Colon$gene.names
grp<-Colon$Y
grp = as.factor(grp)

If we check boxplot we see IQR's of each 62 samples do not match.

boxplot(data)

To solve this problem I tried following code, is it correct?

data = log(data)
data = scale(data)

If we check the side by side boxplot again we can observe IQR and medians match. We can also check mean and sd of each samples:

apply(data, 2,mean)
apply(data, 2, sd)

All means are around 0 and sd's are around 1.

2. source("http://bioconductor.org/biocLite.R")
biocLite("golubEsets")
library('golubEsets')  

two_classes <- TRUE    
# The training data set.
data('Golub_Train')
x <- t(exprs(Golub_Train))
y <- as.vector(pData(Golub_Train)$ALL.AML)

if(!two_classes) {
  y[pData(Golub_Merge)$T.B.cell == "B-cell"] <- "ALL-B"
  y[pData(Golub_Merge)$T.B.cell == "T-cell"] <- "ALL-T"
}

golub_train <- list(x = x, y = factor(y))

# The test data set.
data('Golub_Test')
x <- t(exprs(Golub_Test))
y <- as.vector(pData(Golub_Test)$ALL.AML)

if(!two_classes) {
  y[pData(Golub_Test)$T.B.cell == "B-cell"] <- "ALL-B"
  y[pData(Golub_Test)$T.B.cell == "T-cell"] <- "ALL-T"
}

golub_test <- list(x = x, y = factor(y))


# The combined training and test data sets.
data('Golub_Merge')
x <- t(exprs(Golub_Merge))
y <- as.vector(pData(Golub_Merge)$ALL.AML)

if(!two_classes) {
  y[pData(Golub_Merge)$T.B.cell == "B-cell"] <- "ALL-B"
  y[pData(Golub_Merge)$T.B.cell == "T-cell"] <- "ALL-T"
}

golub <- list(x = x, y = factor(y))
data<-t(golub$x)
grp<-golub$y

I want to use a variance filter first and select top 2000 genes.

RowVar <- function(x) {
rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)
}
rvar = RowVar(data)
qt <- quantile(rvar,probs = 1-2000/dim(data)[1],na.rm = T)
data <- data[apply( data , 1 , function(x) var(x) > qt[1] ),]

Again let's check side by side boxplot we observe IQR and medians mismatch again. We can't take the log directly as there are negative values. I use following idea:

data = data - min(data) + 0.01
data = log(data)
data = scale(data)

We can't see the match of IQR and match of medians like we did in the first example, but results are much better than raw expressions.

Do you think they are correct or have any better-preprocessing suggestions?

R • 1.8k views
ADD COMMENT
0
Entering edit mode

You could check limma package.

ADD REPLY

Login before adding your answer.

Traffic: 1765 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6