Hi everyone,
I am quite new to RNA-seq data analysis, and I am struggling with accounting for batch effect. I have RNA-seq expression data from three different batches, and I want to compare cancer cell lines and normal cell lines to find the most important genes.
I have 40 cancer cell lines (this number includes replicates) and 9 normal cell lines, and I would like to compare cancer and normal cell lines. Cell lines have 1, 2, or 3 replicates, and the batch information is available for most samples, except for 3/40 cancer cell lines and for all normal cell lines, so I have 12 missing value for the batches.
Data are already normalized (I don't know which normalization was used) so I can not perform DESeq analysis. I performed a PCA on the data, and by looking at PC2 (7% of the variance, 20% for PC1), I can see some batch effect.
I tried to use limma to include the batch as a covariate, however limma does not handle missing values in the design matrix, and I have 12 NA for the batches. I can not remove them, as it would remove all the normal cell lines. The solutions I've found are removing batch from the model (but then how to account for batch effect?), impute missing values (is it a good idea for categorical data, specifically for the batch?), adding a batchNA level. For cell lines having replicates, I could also average the gene expression, but I'm not sure it would remove the batch effect.
Do you have any suggestion on how to handle this?
Thank you in advance for any advice!