Question: Remove Batch Effect From RNAseq with SVAseq and Combat
13
gravatar for cankutcubuk
3.4 years ago by
cankutcubuk170
Spain
cankutcubuk170 wrote:

Hi,

I need your help and comments to remove batch effect from RNA-Seq data. 

In total I have almost 1000 RNA-Seq data of different cancer types. Each cancer type has control and tumor groups. There is 1 batch covariate. The idea is to remove the batch effect from all data at once.

First question: I want to remove the batch effect at once for all samples, does this make sense? Or should I remove the batch effect from each cancer type separately?

Second Question: How should I construct my model matrice properly? 

Until now I used Combat and SVAseq. And for SVAseq, I tried different model matrices. Below I shared the results with you.

my_batch” = “3_Lab” have 3 levels as “Laboratory-A”, “Laboratory-B” and “ Laboratory-C” 
“tumor_normal” has 2 levels as “tumor” and “normal”
“cancer_type” has 9 levels as “Cancer1”, “Cancer2”,...,”Cancer9”
“All_cancer” is data matrix of the read counts of all samples

# For Combat

my_DGEList <- DGEList(counts=All_cancer)
my_mod = model.matrix(~as.factor(tumor_normal))
All_cancer_norm <- calcNormFactors(my_DGEList, method="upperquartile")
my_data = cpm(All_cancer_norm, log=TRUE, prior.count=2)
combat <- ComBat(dat=my_data, batch=my_batch, mod=my_mod)

# For SVAseq

my_data2 <-log(as.matrix(All_cancer) +1)
mod1 = model.matrix(~as.factor(tumor_normal))
mod0 = cbind(mod1[,1])
my_n_sv = num.sv(my_data2,mod1,method="leek")
my_svseq = svaseq(my_data2,mod1,mod0,n.sv=my_n_sv)
my_sv<-my_svseq$sv

.....remove these surrogate variables(my_sv) from data. The same approach for SVAseq was repated with different model matrices.
The function to remove surrogate variables which I used can be found here : https://www.biostars.org/p/121489/

 

##### Results #####

### Uncorrected (Data with batch effect)

As you can see left plot colored by batch covariate and there are clusters based on the batch.

http://i57.tinypic.com/dfuk37.png

uncorrected

### Corrected Data

## Combat

http://i60.tinypic.com/rrmnp1.png

combat_separated

## svaseq no:1

mod1 = model.matrix(~as.factor(tumor_normal)) 

my_n_sv=5

http://i57.tinypic.com/fupc35.png

svaseq1

## svaseq no:2  

mod1 = model.matrix(~as.factor(3_Lab))

my_n_sv=4

http://i58.tinypic.com/sqh93n.png

svaseq2

## svaseq no:3

mod1 = model.matrix(~as.factor(tumor_normal) + as.factor(3_Lab))

my_n_sv=4

http://i61.tinypic.com/2zjj3er.png

svaseq3

## svaseq no:4

mod1 = model.matrix(~as.factor(3_Lab)+as.factor(cancer_type))  

gives error:

Error in solve.default(t(mod) %*% mod) :  
 system  is computationally singular: reciprocal condition number = 1.36348e-18

## svaseq no:5

mod1 = model.matrix(~as.factor(tumor_normal)+as.factor(cancer_type))

my_n_sv=1

http://i62.tinypic.com/14b1d06.png

svaseq5

## svaseq no:6

mod1 = model.matrix(~as.factor(tumor_normal)+as.factor(3_Lab)+as.factor(cancer_type)))

gives error:

Error in solve.default(t(mod) %*% mod) : 
system is computationally singular: reciprocal condition number = 3.45223e-18

## svaseq no:7

mod1 = model.matrix(~as.factor(cancer_type))

my_n_sv=1

http://i61.tinypic.com/2wom4gg.png

svaseq7

I am looking forward to read your precious comments.

Cheers

Cankut

 

ADD COMMENTlink modified 2.0 years ago by amoltej80 • written 3.4 years ago by cankutcubuk170

I am wondering why you are log transforming the data that you input to svaseq? Isn't it already transforming it internally? Would it effect the results?

ADD REPLYlink written 2.6 years ago by yasinkaymaz0
1

Hi Yasin, Sorry for my late reply. You are right. svaseq does it internally. I missed this part. You can skip the log transformation before using svaseq. Taken from svaseq: The function takes log(dat + constant) before performing sva. By default constant = 1, all values of dat + constant should be positive.

Cheers Cankut

ADD REPLYlink written 2.4 years ago by cankutcubuk170

@cankutcubuk and @yasinkaymaz Since the data has internally been transformed to log format, does this mean we can not analyze the data using DESeq2 or edgeR or Limma? Limma uses voom to transform the data and has a weight matrix. Other two package require count data. If I want to use these three packages to analyze the data, what should I do? If the three packages are not available, then what step should I do to process the data? Thanks.

ADD REPLYlink written 2.4 years ago by zhenyisong120
2
gravatar for raunakms
3.4 years ago by
raunakms960
Vancouver, BC, Canada
raunakms960 wrote:

I would really suggest that instead of removing the batch effect from all the samples at once, try to remove batch effect from each cancer type separately and merge the results later. Remember that there are two thing going on simultaneously (biological variability) + (batch effect). The key thing to remember is that you do not want to mess-up too much with the biological variability inherent in the samples while correcting for the batch effects. You can check this by using a simple PCA based clustering of you samples before or after correcting for batch effect. Check how the batch labels and phenotype labels (normal-tumor) are clustered and judge for yourself if the batch correcting tool really removed the batch effect or messed up with the biological variation.

Another thing to keep in mind is that such batch effect correction may or may not be useful. It may work for some cancer-types and may-not work for other types.There are times when correcting batch effect correction was not useful for my data. There is no definite answer accepting or rejecting a batch effect correction.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by raunakms960

Hi Raunakms, thanks for your reply.

Cheers

ADD REPLYlink written 3.1 years ago by cankutcubuk170
0
gravatar for amoltej
2.0 years ago by
amoltej80
Australia
amoltej80 wrote:

Hi, did you find right answer for the question? which way is right? and what if n.sv. number is really big? I have 88 samples. when I am using "leek" method to calculate n.sv = 85 but with "be" method n.sv=1. I am not understanding why there is so much difference? And as Raunakms commented, it should be handled treatment wise to remove batch effect and not as a whole data set, it seems logical. but did you find it useful when doing it? Thanx

ADD COMMENTlink written 2.0 years ago by amoltej80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 573 users visited in the last hour