Question

Negative values after batch correction using removeBatchEffect from Limma

0

Entering edit mode

21 months ago

ev97 ▴ 20

I am trying to correct my RNA seq data for 3 categorical variables as well as preserve the biological information of the dataset. In order to do that, I have used the removeBatchEffect function from limma.

I used a log2(TPM counts + 1) matrix as my input but... as you can see in the following screenshot, the corrected counts that I get have negative values. Since it doesn't make sense to have genes with a negative expression, is there any way to solve this? I don't know if the output is log2(tpm) and I just have to do +1 or unlog the corrected data... In any case, I cannot remove the genes with negative values, I would like to keep all of them if it is possible.

corrected_cts

Does anybody know how to solve this?

Thanks very much in advance

Code:

set.seed(11344)

covariates_df <- data.frame(
  Sample = paste0("A", seq(1,960)),
  Age = sample(seq(18, 70), size = 960, replace = TRUE),
  Group = sample(seq(1, 3), size = 960, replace = TRUE),
  Sex = sample(seq(1, 2), size = 960, replace = TRUE),
  WBC = sample(seq(0.5, 35, by=0.3), size = 960, replace = TRUE),
  Place = sample(seq(1, 2), size = 960, replace = TRUE),
  Tubes = sample(seq(1, 2), size = 960, replace = TRUE),
  LibPrep = sample(seq(1, 16), size = 960, replace = TRUE)
)
rownames(covariates_df) <- covariates_df$Sample

categorical_variables <- c("Sample", "Group", "Sex", "Place", "Tubes", "LibPrep")
covariates_df[,categorical_variables] <- lapply(covariates_df[,categorical_variables],as.factor)

str(covariates_df)

'data.frame':   960 obs. of  8 variables:
 $ Sample : Factor w/ 960 levels "A1","A10","A100",..: 1 112 223 334 445 556 667 778 889 2 ...
 $ Age    : int  40 66 37 40 25 42 46 51 45 61 ...
 $ Group  : Factor w/ 3 levels "1","2","3": 2 2 2 3 2 2 3 3 3 3 ...
 $ Sex    : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 1 2 1 ...
 $ WBC    : num  12.8 29.9 25.7 0.8 30.8 5.9 30.5 19.1 2.9 12.2 ...
 $ Place  : Factor w/ 2 levels "1","2": 1 1 2 1 2 1 2 2 1 2 ...
 $ Tubes  : Factor w/ 2 levels "1","2": 1 1 2 1 2 2 2 2 1 2 ...
 $ LibPrep: Factor w/ 16 levels "1","2","3","4",..: 14 1 13 4 10 5 1 8 7 5 ...


dim(log2TPM_1_counts_mat)
[1] 60000   960

str(log2TPM_1_counts_mat)
 num [1:60000, 1:960] 0.51 0 2.76 2.64 2.26 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:60000] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
  ..$ : chr [1:960] "A1" "A2" "A3" "A4" ...

### I wanted to upload the counts data but it is too huge and I didn't manage. I tried to create a small counts data, but I wasn't getting any negative values as I get in reality in order to show you my problem.

Design of the model:

design <- model.matrix(~Age + Group + Sex + WBC + Place + Tubes + LibPrep, data=covariates_df)
bio.design <- design[,1:6]
batch.design <- design[,-(1:6)]

Adjustment with removeBatchEffect:

corrected_cts <- limma::removeBatchEffect(log2TPM_1_counts_mat,design=bio.design,covariates=batch.design)

batch limma removeBatchEffect rna-seq batch-effect • 2.2k views

ADD COMMENT • link updated 9 days ago by Ram 43k • written 21 months ago by ev97 ▴ 20

score 2 · Accepted Answer · 2022-07-29

2

Entering edit mode

21 months ago

LChart 3.9k

For a downstream analysis like clustering or WGCNA, I would recommend against altering the data. In this case, gene expression values have been replaced with imputed gene expression values (i.e., what would the expression have been, had the sample originated from the reference tube); since the imputation is done on the linear scale it's not bad or un-biological to have a negative value. Note that the vast majority of the negative counts will come from genes or transcripts with low mean expression in the first place.

You could also consider quantile normalizing the corrected log1p values to the original log1p distribution. A quick-and-dirty way to do this is

quantile.normalize(x, target) {
  sort(target)[rank(x)]
}

expr.adj <- matrix(0, nrow=nrow(expr), ncol=ncol(expr))
for ( i in 1:ncol(expr.adj) ) {
  expr.adj[,i] <- quantile.normalize(expr.corrected[,i], expr[,i])
}

This preserves the ranks of the corrected data, but retains the original distribution of the un-corrected data. Note that this will "undo" some of the batch correction, particularly for genes with many negative values.

For my own analyses, I live with the negative values.

ADD COMMENT • link 21 months ago by LChart 3.9k

0

Entering edit mode

Thanks very much for your reply.

I think I will continue with the negative values, because I don't want to transform more the data than batch correction does... but thanks for giving other alternative and explain how to do it, I really appreciate it.

ADD REPLY • link 21 months ago by ev97 ▴ 20

0

Entering edit mode

What is "log1p"?

ADD REPLY • link 21 months ago by Ram 43k

0

Entering edit mode

Abbreviation for log(1+x); from https://numpy.org/doc/stable/reference/generated/numpy.log1p.html

ADD REPLY • link 21 months ago by LChart 3.9k