Remove Batch Effect From RNAseq with SVAseq and Combat
2
15
Entering edit mode
7.3 years ago
cankutcubuk ▴ 190

Hi,

I need your help and comments to remove batch effect from RNA-Seq data.

In total I have almost 1000 RNA-Seq data of different cancer types. Each cancer type has control and tumor groups. There is 1 batch covariate. The idea is to remove the batch effect from all data at once.

First question: I want to remove the batch effect at once for all samples, does this make sense? Or should I remove the batch effect from each cancer type separately?

Second Question: How should I construct my model matrice properly?

Until now I used Combat and SVAseq. And for SVAseq, I tried different model matrices. Below I shared the results with you.

• my_batch = "3_Lab" have 3 levels as "Laboratory-A", "Laboratory-B" and " Laboratory-C"
• tumor_normal has 2 levels as "tumor" and "normal"
• cancer_type has 9 levels as "Cancer1", "Cancer2",...,"Cancer9"
• All_cancer is data matrix of the read counts of all samples

# For Combat

my_DGEList <- DGEList(counts=All_cancer)
my_mod = model.matrix(~as.factor(tumor_normal))
All_cancer_norm <- calcNormFactors(my_DGEList, method="upperquartile")
my_data = cpm(All_cancer_norm, log=TRUE, prior.count=2)
combat <- ComBat(dat=my_data, batch=my_batch, mod=my_mod)


# For SVAseq

my_data2 <-log(as.matrix(All_cancer) +1)
mod1 = model.matrix(~as.factor(tumor_normal))
mod0 = cbind(mod1[,1])
my_n_sv = num.sv(my_data2,mod1,method="leek")
my_svseq = svaseq(my_data2,mod1,mod0,n.sv=my_n_sv)
my_sv<-my_svseq\$sv


.....remove these surrogate variables(my_sv) from data. The same approach for SVAseq was repated with different model matrices.

The function to remove surrogate variables which I used can be found here: Remove Surrogate Variables and Batch

##### Results #####

### Uncorrected (Data with batch effect)

As you can see left plot colored by batch covariate and there are clusters based on the batch.

http://i57.tinypic.com/dfuk37.png

### Corrected Data

## Combat

http://i60.tinypic.com/rrmnp1.png

## svaseq no: 1

mod1 = model.matrix(~as.factor(tumor_normal))
my_n_sv=5


http://i57.tinypic.com/fupc35.png

## svaseq no: 2

mod1 = model.matrix(~as.factor(3_Lab))
my_n_sv=4


http://i58.tinypic.com/sqh93n.png

## svaseq no: 3

mod1 = model.matrix(~as.factor(tumor_normal) + as.factor(3_Lab))
my_n_sv=4


http://i61.tinypic.com/2zjj3er.png

## svaseq no: 4

mod1 = model.matrix(~as.factor(3_Lab)+as.factor(cancer_type))


gives error:

Error in solve.default(t(mod) %*% mod) :
system  is computationally singular: reciprocal condition number = 1.36348e-18


## svaseq no: 5

mod1 = model.matrix(~as.factor(tumor_normal)+as.factor(cancer_type))
my_n_sv=1


http://i62.tinypic.com/14b1d06.png

## svaseq no: 6

mod1 = model.matrix(~as.factor(tumor_normal)+as.factor(3_Lab)+as.factor(cancer_type)))


gives error:

Error in solve.default(t(mod) %*% mod) :
system is computationally singular: reciprocal condition number = 3.45223e-18


## svaseq no: 7

mod1 = model.matrix(~as.factor(cancer_type))
my_n_sv=1


http://i61.tinypic.com/2wom4gg.png

Cheers
Cankut

RNA-Seq Combat SVAseq Batch-Effect model.matrix • 22k views
0
Entering edit mode

I am wondering why you are log transforming the data that you input to svaseq? Isn't it already transforming it internally? Would it effect the results?

1
Entering edit mode

Hi Yasin, Sorry for my late reply. You are right. svaseq does it internally. I missed this part. You can skip the log transformation before using svaseq. Taken from svaseq: The function takes log(dat + constant) before performing sva. By default constant = 1, all values of dat + constant should be positive.

Cheers Cankut

0
Entering edit mode

@cankutcubuk and @yasinkaymaz Since the data has internally been transformed to log format, does this mean we can not analyze the data using DESeq2 or edgeR or Limma? Limma uses voom to transform the data and has a weight matrix. Other two package require count data. If I want to use these three packages to analyze the data, what should I do? If the three packages are not available, then what step should I do to process the data? Thanks.

3
Entering edit mode
7.3 years ago
raunakms ★ 1.1k

I would really suggest that instead of removing the batch effect from all the samples at once, try to remove batch effect from each cancer type separately and merge the results later. Remember that there are two thing going on simultaneously (biological variability) + (batch effect). The key thing to remember is that you do not want to mess-up too much with the biological variability inherent in the samples while correcting for the batch effects. You can check this by using a simple PCA based clustering of you samples before or after correcting for batch effect. Check how the batch labels and phenotype labels (normal-tumor) are clustered and judge for yourself if the batch correcting tool really removed the batch effect or messed up with the biological variation.

Another thing to keep in mind is that such batch effect correction may or may not be useful. It may work for some cancer-types and may-not work for other types.There are times when correcting batch effect correction was not useful for my data. There is no definite answer accepting or rejecting a batch effect correction.

0
Entering edit mode

Cheers

1
Entering edit mode
5.9 years ago
amoltej ▴ 90

Hi, did you find right answer for the question? which way is right? and what if n.sv. number is really big? I have 88 samples. when I am using "leek" method to calculate n.sv = 85 but with "be" method n.sv=1. I am not understanding why there is so much difference? And as Raunakms commented, it should be handled treatment wise to remove batch effect and not as a whole data set, it seems logical. but did you find it useful when doing it? Thanx

0
Entering edit mode

Hi,did you find out the reason why n.sv. number is really big when use "leek" method but small with "be" method? thanks