Question

Unsure whether or not I overcorrected for batch effect in dataset

1

Entering edit mode

5 months ago

Megan ▴ 60

Hi,

I am doing a timecourse study looking at the differences between sexs and time. I had noticed from my PCA plot that there was a lot of large group variation within my biological replicates.

enter image description here

It looks to me that some of the points are clustering together. I performed a batch effect to account for unknown sources of variation and created a new PCA plot to examine the clustering within my samples. However, I am not sure whether I overcorrected for any possible unknown variation within my data.

enter image description here

My code is attached below.

```{r setup, include=FALSE}



Brown_Fat <- gene_fpkm %>%
  select(gene_id,
    MBF2_2, MBF2_3, MBF6_1, MBF6_2, MBF6_3, MBF10_1, MBF10_2, MBF10_3, MBF14_1, MBF14_2,
    MBF14_3, MBF18_1, MBF18_2, MBF18_3, MBF22_1, MBF22_2, MBF22_3, FBF2_1, FBF2_2, FBF2_3, FBF6_1,
    FBF6_2, FBF6_3, FBF10_1, FBF10_2, FBF10_3, FBF14_1, FBF14_2, FBF14_3, FBF18_1, FBF18_2,
    FBF18_3, FBF22_1, FBF22_2, FBF22_3) %>%
    column_to_rownames(var = 'gene_id') 

keep <- rowMeans(Brown_Fat) > 0.5 # e.g., 1 or 0.5
BrownFat_filtered <- Brown_Fat[keep, ]
BrownFat_filtered_log2RPKM <- log2(BrownFat_filtered + 1)

samples <- c(
  "MBF2_2", "MBF2_3", "MBF6_1", "MBF6_2", "MBF6_3", "MBF10_1", "MBF10_2", "MBF10_3",
  "MBF14_1", "MBF14_2", "MBF14_3", "MBF18_1", "MBF18_2", "MBF18_3", "MBF22_1", "MBF22_2", "MBF22_3",
  "FBF2_1", "FBF2_2", "FBF2_3", "FBF6_1", "FBF6_2", "FBF6_3", "FBF10_1", "FBF10_2", "FBF10_3",
  "FBF14_1", "FBF14_2", "FBF14_3", "FBF18_1", "FBF18_2", "FBF18_3", "FBF22_1", "FBF22_2", "FBF22_3"
)

metadata <- data.frame(
  Sample = samples,
  Sex = ifelse(substr(samples, 1, 1) == "M", "Male", "Female"),
  Time = as.integer(gsub("^.{3}(\\d+)_.*$", "\\1", samples)),  # Extract digits after first 3 characters
  stringsAsFactors = FALSE
)

metadata_v2 <- data.frame(
  Sample = samples,
  Sex = ifelse(substr(samples, 1, 1) == "M", "Male", "Female"),
  Time = as.integer(gsub("^.{3}(\\d+)_.*$", "\\1", samples)),  # Extract digits after first 3 characters
  stringsAsFactors = FALSE
)


metadata_v2 <- metadata_v2 %>%
  column_to_rownames(var = 'Sample')


metadata_v2$Time <- as.factor(metadata_v2$Time)


PCA_Uncorrected <- t(BrownFat_filtered_log2RPKM)
pca_res_uncorrected <- prcomp(PCA_Uncorrected, scale. = TRUE)

autoplot(pca_res_uncorrected, data = metadata, color = "Time", shape = 'Sex' )

cor_matrix <- cor(BrownFat_filtered_log2RPKM, method = "pearson")  # or "spearman"

# Visualize with pheatmap
pheatmap(cor_matrix,
         main = "Sample-to-Sample Correlation",
         annotation_col = metadata_v2,
          annotation_row = metadata_v2)


```

```{r setup, include=FALSE}



rownames(metadata) <- colnames(Brown_Fat) 

metadata$Time <- as.factor(metadata$Time)
metadata$Replicate <- as.factor(metadata$Replicate)


form <- ~  (1 | Sex ) + (1 | Time) + (1 | Replicate) + (1 | Sex:Time)


varPart <- fitExtractVarPartModel(BrownFat_filtered, form, metadata)

vp <- sortCols(varPart)

library("limma")

plotVarPart(vp)


expression_data <- as.matrix(log2(BrownFat_filtered + 1))

mod <- model.matrix(~ Sex + as.factor(Time), data = metadata)
mod0 <- model.matrix(~1, data = metadata)

n.sv <- num.sv(expression_data,mod, method = c("be"), B = 20)

svobj = sva(expression_data,mod,mod0,n.sv= 3)

expression_corrected <- removeBatchEffect(expression_data, covariates = svobj$sv, design = mod)

dat_t <- t(expression_corrected)
pca_res <- prcomp(dat_t, scale. = TRUE)

autoplot(pca_res, data = metadata, color = "Time", shape = 'Sex' )

cor_matrix <- cor(expression_corrected, method = "pearson")  # or "spearman"

# Visualize with pheatmap
pheatmap(cor_matrix,
         main = "Sample-to-Sample Correlation",
         annotation_col = metadata_v2,
         annotation_row = metadata_v2)

```

Quality Plot PCA Control • 712 views

ADD COMMENT • link updated 5 months ago by GenoMax 154k • written 5 months ago by Megan ▴ 60

0

Entering edit mode

It would help if you provided your code so others know what steps you took to batch-correct your data