Question

missing values for batch in RNA-seq data

0

Entering edit mode

8 hours ago

donan • 0

Hi everyone,

I am quite new to RNA-seq data analysis, and I am struggling with accounting for batch effect. I have RNA-seq expression data from three different batches, and I want to compare cancer cell lines and normal cell lines to find the most important genes.

I have 40 cancer cell lines (this number includes replicates) and 9 normal cell lines, and I would like to compare cancer and normal cell lines. Cell lines have 1, 2, or 3 replicates, and the batch information is available for most samples, except for 3/40 cancer cell lines and for all normal cell lines, so I have 12 missing value for the batches.

Data are already normalized (I don't know which normalization was used) so I can not perform DESeq analysis. I performed a PCA on the data, and by looking at PC2 (7% of the variance, 20% for PC1), I can see some batch effect.

I tried to use limma to include the batch as a covariate, however limma does not handle missing values in the design matrix, and I have 12 NA for the batches. I can not remove them, as it would remove all the normal cell lines. The solutions I've found are removing batch from the model (but then how to account for batch effect?), impute missing values (is it a good idea for categorical data, specifically for the batch?), adding a batchNA level. For cell lines having replicates, I could also average the gene expression, but I'm not sure it would remove the batch effect.

Do you have any suggestion on how to handle this?

Thank you in advance for any advice!

RNA-seq limma deseq batch-effect batch • 82 views

ADD COMMENT • link updated 7 hours ago by LChart 5.1k • written 8 hours ago by donan • 0

score 0 · Answer 1 · 2025-10-20

You have a near-perfect confound between missingness and condition, and no batch information at all on your normal group. The closest thing you have to a solution is adding a batchNA level - but this will not correct for any batch effects; only the aggregate difference of the "no-batch" group from the overall mean. Beyond that, there really is nothing you can do except proceed with the analysis, with the caveat that you don't know if the results are due to biological differences or due to batch effects.