Tutorial:removeBatchEffect explained using base R linear models
0
5
Entering edit mode
7 weeks ago
nhaus ▴ 360

Hi all,

I always had a some difficulties to understand what removing the effect of a covariates actually means and what happens "under the hood". So I decided to make a small R example that shows what limma::removeBatchEffect does, by "implementing" the method just using base R linear models. It really helped me to understand so I though I would share it:

### Batch correction while preserving a biological effect

library(limma)

# Parameters
num_genes <- 100
num_samples <- 12

# Simulate some data: 100 genes and 12 samples
# Base expression levels
base_expression <- matrix(rnorm(num_genes * num_samples, mean = 10, sd = 3), nrow = num_genes, ncol = num_samples)
rownames(base_expression) <- paste("Gene", 1:num_genes)
colnames(base_expression) <- paste("Sample", 1:num_samples)

# Create batch factors
batch <- factor(rep(c("Batch1", "Batch2", "Batch3"), 4))

# Create a biological factor (treatment)
treatment <- factor(rep(c("Control", "Treatment"), 6))

# Apply consistent batch effects
batch_effects <- c(5, -4, 9) # Different consistent effects for each batch
batch_matrix <- matrix(batch_effects[as.integer(batch)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

# Apply consistent treatment effects
treatment_effects <- c(0, 15) # No effect for control, positive effect for treatment
treatment_matrix <- matrix(treatment_effects[as.integer(treatment)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

expression <- base_expression + batch_matrix + treatment_matrix

# Manual adjustment using lm() preserving biological condition
adjusted_expression_manual <- apply(expression, 1, function(gene_expression) {
full_model <- lm(gene_expression ~ batch + treatment)
reduced_model <- lm(gene_expression ~ treatment)
# isolate the component of the gene expression predictions that is due to the batch effect.
# Is the portion of gene expression that can be attributed solely to the batch, independent of the biological variable (treatment).
batch_effect <- full_model$fitted.values - reduced_model$fitted.values
adjusted_expression <- gene_expression - batch_effect # remove the batch effect
})

# Convert adjusted expression to a matrix for easier handling

# Adjust using limma's removeBatchEffect while preserving biological effects
design_matrix <- model.matrix(~treatment)
adjusted_expression_limma <- removeBatchEffect(expression, batch = batch, design = design_matrix)

# Comparing the two approaches
print("Difference between Manual and limma Approaches:")


### Batch correction if you do not want to preserve anything

# Parameters
num_genes <- 100
num_samples <- 12

# Simulate some data: 100 genes and 12 samples
# Base expression levels
base_expression <- matrix(rnorm(num_genes * num_samples, mean = 10, sd = 3), nrow = num_genes, ncol = num_samples)
rownames(base_expression) <- paste("Gene", 1:num_genes)
colnames(base_expression) <- paste("Sample", 1:num_samples)

# Create batch factors
batch <- factor(rep(c("Batch1", "Batch2", "Batch3"), 4))

# Create a biological factor (treatment)
treatment <- factor(rep(c("Control", "Treatment"), 6))

# Apply consistent batch effects
batch_effects <- c(5, -4, 9) # Different consistent effects for each batch
batch_matrix <- matrix(batch_effects[as.integer(batch)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

# Apply consistent treatment effects
treatment_effects <- c(0, 15) # No effect for control, positive effect for treatment
treatment_matrix <- matrix(treatment_effects[as.integer(treatment)], nrow = num_genes, ncol = num_samples, byrow = TRUE)

expression <- base_expression + batch_matrix + treatment_matrix
# Manual adjustment using lm() preserving biological condition
adjusted_expression_manual <- apply(expression, 1, function(gene_expression) {
model <- lm(gene_expression ~ batch)
})

# Convert adjusted expression to a matrix for easier handling

# Adjust using limma's removeBatchEffect while preserving biological effects
design_matrix <- model.matrix(~treatment)
batch = batch,
design = matrix(1, ncol(expression), 1) # essentially says, there are no experimental conditions, all belong to same category
)

# Print the results

# Comparing the two approaches
print("Difference between Manual and limma Approaches:")
print(sum(comparison))

limma effects batch removebatcheffects • 297 views
2
Entering edit mode

(+1) "implementing" the method just using base R linear models.: I find implementing from basic starting point to be one of the best, if not the best, way to learn what a procedure does.