Dear all,
I’m performing pseudobulk analysis using my scRNA-seq data along with publicly available scRNA-seq datasets with DEseq2.
Naturally, there are batch effects between the two datasets (e.g., nFeatureCount, percent mitochondrial/ribosomal RNA), and my dataset shows lower mitochondrial and ribosomal RNA percentages.
I'm performing DEG analysis from the raw count matrix from all data. In the initial DEG analysis, many mitochondrial/ ATP/ translation-related pathways appeared enriched among the downregulated pathways in my dataset.
So, in the next step, I calculated the mean mitochondrial and ribosomal RNA percentage per sample and included them in matrix and as covariates in the design formula : design = ~ percent_mito + percent_ribo + group
This reduced—but did not eliminate—the mitochondrial and ATP-related pathways from the downregulated results.
Now I’m unsure whether these findings reflect true biological differences or technical artifacts.
In such cases, what is the best way to correct for batch effects?
Thanks for reading.
Does the design even allow batch correction? Like, are replicates of all groups in all datasets? If not then all this %mt inclusion and stuff is just homeopathic cosmetics. It won't solve the confounded design so better just driopping analysis right now to not waste time.
Thanks for your reply, your point is completely valid.
I agree that this doesn't resolve the core issue of confounding as my dataset is kind of sparse subset population of public dataset.
Appreciate the insight!