Hi everyone,
I am quite new to RNA-seq data analysis, and I am struggling with accounting for batch effect. I have RNA-seq expression data from three different batches, and I want to compare cancer cell lines and normal cell lines to find the most important genes.
I have 40 cancer cell lines (this number includes replicates) and 9 normal cell lines, and I would like to compare cancer and normal cell lines. Cell lines have 1, 2, or 3 replicates, and the batch information is available for most samples, except for 3/40 cancer cell lines and for all normal cell lines, so I have 12 missing value for the batches.
Data are already normalized (I don't know which normalization was used) so I can not perform DESeq analysis. I performed a PCA on the data, and by looking at PC2 (7% of the variance, 20% for PC1), I can see some batch effect.
I tried to use limma to include the batch as a covariate, however limma does not handle missing values in the design matrix, and I have 12 NA for the batches. I can not remove them, as it would remove all the normal cell lines. The solutions I've found are removing batch from the model (but then how to account for batch effect?), impute missing values (is it a good idea for categorical data, specifically for the batch?), adding a batchNA level. For cell lines having replicates, I could also average the gene expression, but I'm not sure it would remove the batch effect.
Do you have any suggestion on how to handle this?
Thank you in advance for any advice!
Is this your data or did you wildly collect them from different sources like GEO? Mind that there is almost no "normal" cell lines. There are few exceptions of non malignantly-transformed immortal cell lines, such as Hox overexpression or MEF cells, but most are some sort of cancer. Can you elaborate?
Thank you for your answer. I am quite new in this project and still getting familiar with RNA-seq data, so I may lack technical precision. The data come from one source (another lab), unfortunately I have no other information, I only know these are immortalized normal human cells.
You have to be careful with these sorts of analysis. Cancer cell lines are often in culture for years or decades, while these others could be relatively new. Likely, the "top genes" will be those representing the adaptation to the culture conditions, and not necessarily any biologically meaningful signal. I would start exploring data via PCA to see whether the batch information explains any obvious separation. With cell lines you have tons of confounders, maybe the batch is the least of your problems. Lets go and see. For example,
plotPCA
from DESeq2.Thank you very much for your help. I will try to have more information for the data, but right now I have to deal with this lack of information. When plotting PC1 vs PC2, there is no obvious visual cluster for each batch. When looking at a boxplot for the batches for PC1 (20% of the variance), I can visually see some variability, which becomes more obvious when looking at the second component (7% of the variance) in which batch 1 and 3 seem to have the same distribution, while the NA batch for cancer cell lines seem to match with batch 2 (and normal cell lines have a completely different value for PC2 compared to cancer cell lines). So there is a batch effect, but I don't really know how to quantify it and reduce its effect.