Question

Comparison of different RNAseq data sets

1

Entering edit mode

16 days ago

DrSmad ▴ 10

Hi everyone I am not a computational biologist. Having said that, I work with lot of transcriptomic data. Thanks to tools like iDEP, I was able to do fairly well.

I have downloaded publicly available data sets (bulk RNA seq) of different cell lines of a cell type.Control vs LPS for every cell line.They all come from different labs. I understand that I have to adjust for batch effects. I did that using DEBbrowser. I have 3 questions on this front.

What criteria best suits this kind of data interms of batch correction. I used TMM and combatseq. Batch was just a number manually added by me. Each dataset got a number.
Do I move onto DEG and GO using the batch corrected values or just the original uncorrected raw count ones.?
Comparing the different cell lines under normal ( control conditions) is also of interest to me. Do I have to do a batch correction again on this matrix of only control samples or the global correction that was done (control and LPS) is sufficient?

I would truly be thankful for any help.

Sincerely, a fellow scientist who is way over in his head :)

batch-effect DEG RNAseq • 556 views

ADD COMMENT • link updated 22 hours ago by Ram 45k • written 16 days ago by DrSmad ▴ 10

0

Entering edit mode

Is lab confounded with cell line?

ADD REPLY • link 15 days ago by swbarnes2 15k

0

Entering edit mode

Hi. Thanks for your reply. Yes. Each cell line data comes from different labs.

ADD REPLY • link 1 day ago by DrSmad ▴ 10

0

Entering edit mode

Does it mean each lab contributes both LPS and control, or just one of each? If the latter then I would strongly recommend not do do this analysis, it's confounded and you would be chasing ghosts. If the former, you could simply use a design in your DE analysis ~treatment+lab to adjust for the source of the cells.

ADD REPLY • link 1 day ago by ATpoint 90k

score 0 · Answer 1 · 2025-11-06

Hey,

I will respond to each question in turn. Before I do, a general comment: public data from different labs is always going to be challenging. Different labs may have used different kits, different sequencing machines, different library preparation protocols, etc. - all of which can introduce batch effects (technical variation). In such cases, it is always best to explore the data first via PCA or other dimensionality reduction methods to see if there is indeed a strong batch effect. For example, you can use my PCAtools package in R / Bioconductor for that. If there is a strong batch, then correction is warranted; otherwise, you may not need to do anything and can just include 'batch' in your design formula for differential expression analysis (in DESeq2 or edgeR). On that note, I assume that you are using DEBrowser with DESeq2 or edgeR under the hood(?). If so, then that is fine.

In terms of criteria, your approach is basically fine. TMM is a good normalisation method for RNA-seq, and ComBat-seq is a good method for batch correction of count data (it is based on the original ComBat method but adapted for count data). Assigning a number to each dataset is a good proxy for 'batch', as each dataset likely represents a different lab / experimental batch. To assess if it is 'working', plot PCA bi-plots before and after correction - if the samples mix better after correction (i.e., no clear segregation by batch), then it is working. There is no single 'best' method, but other options include just including batch in the design formula during differential expression analysis (i.e., no explicit correction); otherwise, sva can be used to estimate surrogate variables for unknown batches, and these can then be included in the design formula, too.
For DEG analysis, you should use the batch-corrected counts as input to DESeq2 / edgeR. The original uncorrected counts will still contain the batch effects, which could bias your results. For GO enrichment, you can use the DEG results from the batch-corrected analysis. Just note that ComBat-seq produces adjusted counts that are suitable for input to DESeq2 / edgeR.
For comparing cell lines under control conditions only, I would recommend performing batch correction again on just the control samples. The global correction (on control + LPS) may have over- or under-corrected in ways that are not optimal for the control-only subset. Again, check PCA before / after to see the effect. If the batch effect is not strong in the controls alone, then you may not need to correct (or can just include batch in the design formula).

Kevin