Negative values after batch correction using removeBatchEffect from Limma
1
0
Entering edit mode
9 weeks ago
ev97 • 0

I am trying to correct my RNA seq data for 3 categorical variables as well as preserve the biological information of the dataset. In order to do that, I have used the removeBatchEffect function from limma.

I used a log2(TPM counts + 1) matrix as my input but... as you can see in the following screenshot, the corrected counts that I get have negative values. Since it doesn't make sense to have genes with a negative expression, is there any way to solve this? I don't know if the output is log2(tpm) and I just have to do +1 or unlog the corrected data... In any case, I cannot remove the genes with negative values, I would like to keep all of them if it is possible. Does anybody know how to solve this?

Thanks very much in advance

Code:

set.seed(11344)

covariates_df <- data.frame(
Sample = paste0("A", seq(1,960)),
Age = sample(seq(18, 70), size = 960, replace = TRUE),
Group = sample(seq(1, 3), size = 960, replace = TRUE),
Sex = sample(seq(1, 2), size = 960, replace = TRUE),
WBC = sample(seq(0.5, 35, by=0.3), size = 960, replace = TRUE),
Place = sample(seq(1, 2), size = 960, replace = TRUE),
Tubes = sample(seq(1, 2), size = 960, replace = TRUE),
LibPrep = sample(seq(1, 16), size = 960, replace = TRUE)
)
rownames(covariates_df) <- covariates_df$Sample categorical_variables <- c("Sample", "Group", "Sex", "Place", "Tubes", "LibPrep") covariates_df[,categorical_variables] <- lapply(covariates_df[,categorical_variables],as.factor) >str(covariates_df) 'data.frame': 960 obs. of 8 variables:$ Sample : Factor w/ 960 levels "A1","A10","A100",..: 1 112 223 334 445 556 667 778 889 2 ...
$Age : int 40 66 37 40 25 42 46 51 45 61 ...$ Group  : Factor w/ 3 levels "1","2","3": 2 2 2 3 2 2 3 3 3 3 ...
$Sex : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 1 2 1 ...$ WBC    : num  12.8 29.9 25.7 0.8 30.8 5.9 30.5 19.1 2.9 12.2 ...
$Place : Factor w/ 2 levels "1","2": 1 1 2 1 2 1 2 2 1 2 ...$ Tubes  : Factor w/ 2 levels "1","2": 1 1 2 1 2 2 2 2 1 2 ...
$LibPrep: Factor w/ 16 levels "1","2","3","4",..: 14 1 13 4 10 5 1 8 7 5 ... dim(log2TPM_1_counts_mat)  60000 960 > str(log2TPM_1_counts_mat) num [1:60000, 1:960] 0.51 0 2.76 2.64 2.26 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:60000] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
..\$ : chr [1:960] "A1" "A2" "A3" "A4" ...

### I wanted to upload the counts data but it is too huge and I didn't manage. I tried to create a small counts data, but I wasn't getting any negative values as I get in reality in order to show you my problem.


Design of the model:

design <- model.matrix(~Age + Group + Sex + WBC + Place + Tubes + LibPrep, data=covariates_df)
bio.design <- design[,1:6]
batch.design <- design[,-(1:6)]


Adjustment with removeBatchEffect:

corrected_cts <- limma::removeBatchEffect(log2TPM_1_counts_mat,design=bio.design,covariates=batch.design)

rnaseq batch batcheffect removeBatchEffect limma • 529 views
2
Entering edit mode
9 weeks ago
LChart 840

For a downstream analysis like clustering or WGCNA, I would recommend against altering the data. In this case, gene expression values have been replaced with imputed gene expression values (i.e., what would the expression have been, had the sample originated from the reference tube); since the imputation is done on the linear scale it's not bad or un-biological to have a negative value. Note that the vast majority of the negative counts will come from genes or transcripts with low mean expression in the first place.

You could also consider quantile normalizing the corrected log1p values to the original log1p distribution. A quick-and-dirty way to do this is

quantile.normalize(x, target) {
sort(target)[rank(x)]
}

expr.adj <- matrix(0, nrow=nrow(expr), ncol=ncol(expr))
for ( i in 1:ncol(expr.adj) ) {
expr.adj[,i] <- quantile.normalize(expr.corrected[,i], expr[,i])
}


This preserves the ranks of the corrected data, but retains the original distribution of the un-corrected data. Note that this will "undo" some of the batch correction, particularly for genes with many negative values.

For my own analyses, I live with the negative values.

0
Entering edit mode

I think I will continue with the negative values, because I don't want to transform more the data than batch correction does... but thanks for giving other alternative and explain how to do it, I really appreciate it.

0
Entering edit mode

What is "log1p"?

0
Entering edit mode