GEO data preprocessing with fRMA, limma and sva packages
1
1
Entering edit mode
6.6 years ago
landscape95 ▴ 190

Hello, everyone, I am trying to reproduce the lncRNA get from the paper "Portraying breast cancer with lncRNAs" by Olivier et. al. Here I came to the reproducing step:

enter image description here

I don't know whether my code is doing right, can someone give me your comment?

# load RAW files
RAW_files = list.files(path = "./RAWsets", pattern = "_RAW.tar")
# RNA.list = list of RAW file after processed by Frozen Robust Multiarray Analysis
RNA.list = list()
# iterate through the .CEL files
for (RAWstr in RAW_files) {
  print(RAWstr)
  # extract those raw .tar files
  RAWdirectory = paste("./RAWsets", substr(RAWstr, 1, nchar(RAWstr) - 8), sep = "/")
  tardir = paste("./RAWsets", RAWstr, sep = "/")
  untar(tardir, exdir = RAWdirectory, verbose = F)
  # extract those extracted .CEL files
  cels = list.files(RAWdirectory, pattern = "CEL.gz", full.names = T)
  sapply(cels, gunzip)
  # read all cel files to data.raw variable
  data.raw = ReadAffy(filenames = substr(cels, 1, nchar(cels)-3), verbose = F, cdfname = "hgu133plus2cdf")
  # frozen robust multiarray analysis
  data.matrix = frma(data.raw)
  # add to RNA.list as described above
  tmp = list(data.matrix)
  names(tmp) = substr(RAWstr, 1, nchar(RAWstr) - 8)
  RNA.list = append(RNA.list, tmp)
}

### Combine into 1 matrix of all data sets
RNAdata = as.matrix(exprs(RNA.list[[1]]))
for (i in 2:length(RNA.list)) {
  RNAdata = CombineMatrix(RNAdata, as.matrix(exprs(RNA.list[[i]])))
}
# normalize between arrays on a data of fRMA before
RNAdata_scaled = normalizeBetweenArrays(RNAdata)
# because samples names are GSMxxxxxx.CEL => erase .CEL extension
colnames(RNAdata_scaled) = substr(colnames(RNAdata_scaled), 1, 9)

# load data from table S1, S2, S3, S7
load("tables_set.RData")
# mutual index between total data and tableS1 which have clinical information
tableS1 = tableS1[match(colnames(RNAdata_scaled), tableS1[, 1]), ]
# compute batch data using tumor vs normal tissue as covariate
batch = as.numeric(lapply(tableS1[, 3], function(str) if (str == "normal") return(1) else return(2)))
# applied Combat algorithm in sva library with default parameters to adjust data for batch effects
RNAdata_scaled = ComBat(dat = RNAdata_scaled, batch = batch)

Your help is really appreciated!

GEO • 2.9k views
ADD COMMENT
2
Entering edit mode
6.6 years ago
lessismore ★ 1.3k

Hey, it seems that what you are setting as batch is in reality the covariate value (which is defined as the condition or "variable of interest" in the linear model). This is the code for applying ComBat if you specify the batches and the conditions that you have in your dataset

sample <- Pheno$Sample.Name
batch <- Pheno$Batch
condition <- Pheno$Covariate.1
pdata <- data.frame(sample, batch, condition)
modmatrix <- model.matrix(~as.factor(condition), data=pdata)
combat_data_matrix <- ComBat(dat=yourmatrix, batch=batch, mod=modmatrix)
ADD COMMENT
0
Entering edit mode

Thank you, the thing is I don't know how to set the batch values as the paper's description: "using tumor vs normal as covariate". Could you give me some idea?

ADD REPLY
1
Entering edit mode

Ex.

If you have 4 experiments (represented by 4 GEO accession number), each with 2 replicates and with 2 conditions (tumor and normal), your pdata file will look like this:

sample    batch  condition(or covariate)
sample1   GEOn1  normal
sample2   GEOn1  tumor
sample3   GEOn1  normal
sample4   GEOn1  tumor
sample5   GEOn2  normal
sample6   GEOn2  tumor
sample7   GEOn2  normal
sample8   GEOn2  tumor
sample9    GEOn3  normal
sample10   GEOn3  tumor
sample11   GEOn3  normal
sample12   GEOn3  tumor
sample13   GEOn4  normal
sample14   GEOn4  tumor
sample15   GEOn4  normal
sample16   GEOn4  tumor
ADD REPLY
0
Entering edit mode

Thank you very much, I fix the code to this, could you have a comment? Pheno contains 3 columns, for example: "GSM272133 1 normal" is the sample GSM272133 comes from my first GEO data set GSE10780 (I mark it as 1) and it is a normal sample. "GSM272972 2 cancer" is the sample GSM272972 comes from my second GEO data set GSE10810 (I mark it as 2) and it is a tumor sample. Then I apply this code, Is this ok?

colnames(Pheno) = c("sample", "batch", "covariate")
sample = Pheno[, 1]
batch = Pheno[, 2]
condition = Pheno[, 3]
pdata = data.frame(sample, batch, condition)
modmatrix = model.matrix(~as.factor(condition), data=pdata)

# Apply Combat algorithm in sva library with default parameters to adjust data for batch effects
RNAdata_scaled = ComBat(dat=RNAdata, batch=batch, mod=modmatrix)
ADD REPLY
0
Entering edit mode

Looks good, then look with a PCA how your samples are clustering now and compare with the clustering pre-batch-effect correction

autoplot(prcomp(t(RNAdata_scaled)), data=Pheno, colour="covariate")
ADD REPLY
0
Entering edit mode

Thank you. I have another question on "using the limma and fRMA libraries to obtain log2-normalized expression signals for each probe set". Did I do it right in the code, would you please have a look in my code? I am not sure about 2 lines:

data.matrix = frma(data.raw)

and

 RNAdata_scaled = normalizeBetweenArrays(RNAdata)

Thank you!

ADD REPLY
0
Entering edit mode

Never used the frozen RMA. With simple RMA:

#Setwd in the folder of your CEL files
library (Affy)
Data <- ReadAffy(cdfname = "yourcdfname")

#' Normalization
eset <- rma(Data)
norm.data <- exprs(eset))
ADD REPLY

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6