Hi,

I need your help and comments to remove batch effect from RNA-Seq data.

In total I have almost 1000 RNA-Seq data of different cancer types. Each cancer type has control and tumor groups. There is 1 batch covariate. The idea is to remove the batch effect from all data at once.

**First question:** I want to remove the batch effect at once for all samples, does this make sense? Or should I remove the batch effect from each cancer type separately?

**Second Question: **How should I construct my model matrice properly?

Until now I used Combat and SVAseq. And for SVAseq, I tried different model matrices. Below I shared the results with you.

“my_batch” = “3_Lab” have 3 levels as “Laboratory-A”, “Laboratory-B” and “ Laboratory-C” “tumor_normal” has 2 levels as “tumor” and “normal” “cancer_type” has 9 levels as “Cancer1”, “Cancer2”,...,”Cancer9” “All_cancer” is data matrix of the read counts of all samples

*# For Combat*

my_DGEList <- DGEList(counts=All_cancer) my_mod = model.matrix(~as.factor(tumor_normal)) All_cancer_norm <- calcNormFactors(my_DGEList, method="upperquartile") my_data = cpm(All_cancer_norm, log=TRUE, prior.count=2) combat <- ComBat(dat=my_data, batch=my_batch, mod=my_mod)

**# For SVAseq**

my_data2 <-log(as.matrix(All_cancer) +1) mod1 = model.matrix(~as.factor(tumor_normal)) mod0 = cbind(mod1[,1]) my_n_sv = num.sv(my_data2,mod1,method="leek") my_svseq = svaseq(my_data2,mod1,mod0,n.sv=my_n_sv) my_sv<-my_svseq$sv .....remove these surrogate variables(my_sv) from data. The same approach for SVAseq was repated with different model matrices. The function to remove surrogate variables which I used can be found here : https://www.biostars.org/p/121489/

**##### Results #####**

**### Uncorrected (Data with batch effect)**

As you can see left plot colored by batch covariate and there are clusters based on the batch.

http://i57.tinypic.com/dfuk37.png

**### Corrected Data**

**## Combat**

http://i60.tinypic.com/rrmnp1.png

**## svaseq no:1**

mod1 = model.matrix(~as.factor(tumor_normal))

my_n_sv=5

http://i57.tinypic.com/fupc35.png

**## svaseq no:2 **

mod1 = model.matrix(~as.factor(3_Lab))

my_n_sv=4

http://i58.tinypic.com/sqh93n.png

**## svaseq no:3**

mod1 = model.matrix(~as.factor(tumor_normal) + as.factor(3_Lab))

my_n_sv=4

http://i61.tinypic.com/2zjj3er.png

**## svaseq no:4**

mod1 = model.matrix(~as.factor(3_Lab)+as.factor(cancer_type))

gives error:

```
Error in solve.default(t(mod) %*% mod) :
system is computationally singular: reciprocal condition number = 1.36348e-18
```

**## svaseq no:5**

mod1 = model.matrix(~as.factor(tumor_normal)+as.factor(cancer_type))

my_n_sv=1

http://i62.tinypic.com/14b1d06.png

**## svaseq no:6**

mod1 = model.matrix(~as.factor(tumor_normal)+as.factor(3_Lab)+as.factor(cancer_type)))

gives error:

```
Error in solve.default(t(mod) %*% mod) :
system is computationally singular: reciprocal condition number = 3.45223e-18
```

**## svaseq no:7**

mod1 = model.matrix(~as.factor(cancer_type))

my_n_sv=1

http://i61.tinypic.com/2wom4gg.png

I am looking forward to read your precious comments.

Cheers

Cankut

I am wondering why you are log transforming the data that you input to svaseq? Isn't it already transforming it internally? Would it effect the results?

Hi Yasin, Sorry for my late reply. You are right. svaseq does it internally. I missed this part. You can skip the log transformation before using svaseq. Taken from svaseq: The function takes log(dat + constant) before performing sva. By default constant = 1, all values of dat + constant should be positive.

Cheers Cankut

@cankutcubuk and @yasinkaymaz Since the data has internally been transformed to log format, does this mean we can not analyze the data using DESeq2 or edgeR or Limma? Limma uses voom to transform the data and has a weight matrix. Other two package require count data. If I want to use these three packages to analyze the data, what should I do? If the three packages are not available, then what step should I do to process the data? Thanks.