Question

How to remove unknown batch-batch effects from GEO datasets

0

Entering edit mode

16 months ago

2317711200 • 0

enter image description here

Regarding the batch effect in GEO datasets, I attempted to explore the batches using the sva package in the absence of known factors contributing to the batch effect.

However, after removing the batch effect using limma::removeBatchEffect, the results were significantly different from expected. My objective is to quickly obtain the exprsData after removing the batch effect, and perform further analysis beyond just differential expression analysis.

As my professional knowledge is limited, I would like to seek advice from the forum teachers on the following:

What method should I use to quickly obtain the exprsData after removing the batch effect, while taking into consideration pairing and grouping factors during the removal process?
Can I use the RUVg package to explore the batch effect in chip data?

Thank you in advance for your help.

batch-effect GEO • 1.7k views

ADD COMMENT • link updated 16 months ago by Ram 44k • written 16 months ago by 2317711200 • 0

0

Entering edit mode

As a beginner, I hope to receive professional answers. If necessary, I can provide rewards and R code.

ADD REPLY • link 16 months ago by 2317711200 • 0

1

Entering edit mode

The righthand plot might be totally fine once you remove that one outlier on the left, try that first. The batch effect manifesting along PC1 is quite obvious, simply defining the points on the left (below x=-0.025 for example) as batch 1 and the rest as batch 2 might already do the trick. What is the data and what do you want to analyze?

ADD REPLY • link 16 months ago by ATpoint 84k

0

Entering edit mode

It is a great honor to receive your response.

The data in question is sourced from GSE53625. While using arrayQualityMetrics to assess the quality of the data, I noticed batch effects. However, I was unable to identify the known batch effect factors for this dataset.

I would like to remove the batch effects from the data to be used in survival analysis, WGCNA, immune infiltration analysis, and other applications. Therefore, I hope to eliminate the batch effects and bring the data closer to its true distribution. enter image description here

ADD REPLY • link 16 months ago by 2317711200 • 0

0

Entering edit mode

R code:

library(tidyverse)

####数据获取####
#引文
#citation(package = "GEOquery")
#数据提取
GSE53625_matrix<-GEOquery::getGEO(GEO = "GSE53625", destdir = "GEO/GSE53625", getGPL = F)
GSE53625_pData<-Biobase::pData(GSE53625_matrix$GSE53625_series_matrix.txt.gz)
GSE53625_exprs<-Biobase::exprs(GSE53625_matrix$GSE53625_series_matrix.txt.gz)

#整理pData
colnames(GSE53625_pData)
table(GSE53625_pData[,"characteristics_ch1.16"])

GSE53625_pData_fix<-
  dplyr::select(GSE53625_pData,
                Adjuvant_therapy="adjuvant therapy:ch1",Age="age:ch1"   ,               
                Alcohol_use="alcohol use:ch1"         , Anastomotic_leak="anastomotic leak:ch1"   ,  
                Arrhythmia="arrhythmia:ch1"          ,  Status="death at fu:ch1"      ,    
                N_stage="n stage:ch1"       ,       
                Patient_id="patient id:ch1"          , Pneumonia= "pneumonia:ch1"       ,     
                Gender="Sex:ch1"                 ,  Survival_months="survival time(months):ch1",
                T_stage="t stage:ch1"            ,  Tissue_type= "tissue:ch1"         ,      
                TNM_stage="tnm stage:ch1"           , Tobacco_use= "tobacco use:ch1"     ,     
                Grade="tumor grade:ch1"         , Tumor_loation= "tumor loation:ch1" )

GSE53625_pData_fix$Tissue_type<-substr(GSE53625_pData_fix$Tissue_type,1,6)
for (i in 1:ncol(GSE53625_pData_fix)) {
  if (!names(GSE53625_pData_fix)[i]%in%c("Survival_months","Age")) {
    GSE53625_pData_fix[,i]<-as.factor(GSE53625_pData_fix[,i])
    print(names(GSE53625_pData_fix)[i])
    print(table(GSE53625_pData_fix[,i]))
  }

}


####数据标准化####

#查看数据是否已经标准化，不确定可以询问chatGPT：GSE53625的数据是否已经经过标准化处理？
data_processing<-grep(pattern = "process",colnames(GSE53625_pData),value = T)
GSE53625_pData[data_processing][1,]#抽取数据如果不加“，”  返回会是数据框
#查看supplementary_file
GSE53625_pData$supplementary_file[1:8]


#box plot
ggplot(as.data.frame(
  GSE53625_exprs[,1:10])%>%gather(key=samples,value=expression),
  aes(samples,expression))+
  geom_boxplot(color="#457888",alpha=0.6)+
  theme_classic()

#AQM
library(arrayQualityMetrics)
identical(base::colnames(GSE53625_exprs),base::rownames(GSE53625_pData_fix))

set1<-Biobase::ExpressionSet(
  assayData = GSE53625_exprs, #exprs
  phenoData = Biobase::AnnotatedDataFrame(
    GSE53625_pData_fix))#pdata

GSE53625_AQM_arrayTable<-arrayQualityMetrics(set1,
                                             outdir = "GEO/GSE53625/GSE53625_QC_gcrma", 
                                             force = TRUE,
                                             intgroup = "Tissue_type",
                                             do.logtransform = F)$arrayTable

#查看质量不佳样本
colnames(GSE53625_AQM_arrayTable)
colnames(GSE53625_pData_fix)
outlier_detection_table<-
  dplyr::select(GSE53625_AQM_arrayTable,-c("array",
                                           colnames(GSE53625_pData_fix)[!colnames(GSE53625_pData_fix)%in% 
                                                                          c("Tissue_type")]))#只留sample_id


outlier_detection_table<- dplyr::select(outlier_detection_table,"sampleNames","Tissue_type",everything())
colnames(outlier_detection_table)
str(outlier_detection_table)
for (i in 3:ncol(outlier_detection_table)) { #起始列可能要修改
  outlier_detection_table[,i]<-ifelse(outlier_detection_table[,i]=="",0,1)
};rm(i)
str(outlier_detection_table)
outlier_detection_table$sum<-apply(dplyr::select(outlier_detection_table,
                                                 -c("sampleNames","Tissue_type")),
                                   MARGIN = 1,sum)
#查看
outlier_detection_table[outlier_detection_table$sum>=1,]#必须有逗号！
dev.off();dev.off()

BiocGenerics::plotPCA(EDASeq::newSeqExpressionSet(
  counts = as.matrix(GSE53625_exprs_Rbatch), #表达矩阵
  phenoData = GSE53625_pData_fix),#pdata
  col=RColorBrewer::brewer.pal(4, "Set2")
  [as.factor(GSE53625_pData_fix[,"Tissue_type"])],#颜色因子向量
  cex=1, pch = 19, labels = F, main = "set1")


library(sva)
mod <- model.matrix(~ as.factor(GSE53625_pData_fix[,"Tissue_type"]) 
                    )#这个相当于design矩阵
mod0 <- model.matrix(~ 1, data = GSE53625_pData_fix)

n.sv <- num.sv(dat = GSE53625_exprs, mod = mod, method="leek")#来看主要有几个批次效应
n.sv
svobj <- sva(as.matrix(GSE53625_exprs), mod, mod0, n.sv = n.sv)#计算替代变量
modSv <- as.data.frame(cbind(mod, svobj$sv))
str(modSv )


GSE53625_exprs_Rbatch<-limma::removeBatchEffect(
  x=GSE53625_exprs,
  batch =modSv$V3,
  batch2 = modSv$V4
)

ADD REPLY • link updated 16 months ago by Ram 44k • written 16 months ago by 2317711200 • 0

score 1 · Answer 1 · 2023-03-30

removeBatchEffect() wouldn't produce the outlier effect that you show on the right unless there was something seriously wrong with the batch effect variable that you input to it.

The problem in this case is that you have taken continous surrogate variables from sva, and then input them into removeBatchEffect() as if they were factors. This mistake will result in really crazy results. Surrogate variables have to be treated as covariates, not as factors. See the removeBatchEffect help page.

You have also failed to input the design matrix of experimental factors that need to be preserved into removeBatchEffect().

In any case, as ATpoint remarks, using sva() is perhaps overkill here anyway. The major batch effect is quite obvious.