Question

missing values for batch in RNA-seq data

0

Entering edit mode

21 hours ago

donan • 0

Hi everyone,

I am quite new to RNA-seq data analysis, and I am struggling with accounting for batch effect. I have RNA-seq expression data from three different batches, and I want to compare cancer cell lines and normal cell lines to find the most important genes.

I have 40 cancer cell lines (this number includes replicates) and 9 normal cell lines, and I would like to compare cancer and normal cell lines. Cell lines have 1, 2, or 3 replicates, and the batch information is available for most samples, except for 3/40 cancer cell lines and for all normal cell lines, so I have 12 missing value for the batches.

Data are already normalized (I don't know which normalization was used) so I can not perform DESeq analysis. I performed a PCA on the data, and by looking at PC2 (7% of the variance, 20% for PC1), I can see some batch effect.

I tried to use limma to include the batch as a covariate, however limma does not handle missing values in the design matrix, and I have 12 NA for the batches. I can not remove them, as it would remove all the normal cell lines. The solutions I've found are removing batch from the model (but then how to account for batch effect?), impute missing values (is it a good idea for categorical data, specifically for the batch?), adding a batchNA level. For cell lines having replicates, I could also average the gene expression, but I'm not sure it would remove the batch effect.

Do you have any suggestion on how to handle this?

Thank you in advance for any advice!

RNA-seq limma deseq batch-effect batch • 237 views

ADD COMMENT • link 9 hours ago by donan • 0

1

Entering edit mode

Is this your data or did you wildly collect them from different sources like GEO? Mind that there is almost no "normal" cell lines. There are few exceptions of non malignantly-transformed immortal cell lines, such as Hox overexpression or MEF cells, but most are some sort of cancer. Can you elaborate?

ADD REPLY • link 12 hours ago by ATpoint 89k

0

Entering edit mode

Thank you for your answer. I am quite new in this project and still getting familiar with RNA-seq data, so I may lack technical precision. The data come from one source (another lab), unfortunately I have no other information, I only know these are immortalized normal human cells.

ADD REPLY • link 10 hours ago by donan • 0

1

Entering edit mode

You have to be careful with these sorts of analysis. Cancer cell lines are often in culture for years or decades, while these others could be relatively new. Likely, the "top genes" will be those representing the adaptation to the culture conditions, and not necessarily any biologically meaningful signal. I would start exploring data via PCA to see whether the batch information explains any obvious separation. With cell lines you have tons of confounders, maybe the batch is the least of your problems. Lets go and see. For example, plotPCA from DESeq2.

ADD REPLY • link 9 hours ago by ATpoint 89k

0

Entering edit mode

Thank you very much for your help. I will try to have more information for the data, but right now I have to deal with this lack of information. When plotting PC1 vs PC2, there is no obvious visual cluster for each batch. When looking at a boxplot for the batches for PC1 (20% of the variance), I can visually see some variability, which becomes more obvious when looking at the second component (7% of the variance) in which batch 1 and 3 seem to have the same distribution, while the NA batch for cancer cell lines seem to match with batch 2 (and normal cell lines have a completely different value for PC2 compared to cancer cell lines). So there is a batch effect, but I don't really know how to quantify it and reduce its effect.

ADD REPLY • link 9 hours ago by donan • 0

score 1 · Answer 1 · 2025-10-20

1

Entering edit mode

20 hours ago

LChart 5.1k

You have a near-perfect confound between missingness and condition, and no batch information at all on your normal group. The closest thing you have to a solution is adding a batchNA level - but this will not correct for any batch effects; only the aggregate difference of the "no-batch" group from the overall mean. Beyond that, there really is nothing you can do except proceed with the analysis, with the caveat that you don't know if the results are due to biological differences or due to batch effects.

ADD COMMENT • link 20 hours ago by LChart 5.1k

0

Entering edit mode

Thank you for your answer. I also think the solution here is to add a batchNA level. I read about SVA to remove unwanted variation, such as batch effect. I ran it (from 1. to 6.) on my data, but I don't really know if my methodology is good or if I should really stick to adding a batchNA level and just run limma.

ADD REPLY • link 11 hours ago by donan • 0