Hi everyone I am not a computational biologist. Having said that, I work with lot of transcriptomic data. Thanks to tools like iDEP, I was able to do fairly well.
I have downloaded publicly available data sets (bulk RNA seq) of different cell lines of a cell type.Control vs LPS for every cell line.They all come from different labs. I understand that I have to adjust for batch effects. I did that using DEBbrowser. I have 3 questions on this front.
What criteria best suits this kind of data interms of batch correction. I used TMM and combatseq. Batch was just a number manually added by me. Each dataset got a number.
Do I move onto DEG and GO using the batch corrected values or just the original uncorrected raw count ones.?
Comparing the different cell lines under normal ( control conditions) is also of interest to me. Do I have to do a batch correction again on this matrix of only control samples or the global correction that was done (control and LPS) is sufficient?
I would truly be thankful for any help.
Sincerely, a fellow scientist who is way over in his head
:)
Does it mean each lab contributes both LPS and control, or just one of each? If the latter then I would strongly recommend not do do this analysis, it's confounded and you would be chasing ghosts. If the former, you could simply use a design in your DE analysis ~treatment+lab to adjust for the source of the cells.
I will respond to each question in turn. Before I do, a general comment: public data from different labs is always going to be challenging. Different labs may have used different kits, different sequencing machines, different library preparation protocols, etc. - all of which can introduce batch effects (technical variation). In such cases, it is always best to explore the data first via PCA or other dimensionality reduction methods to see if there is indeed a strong batch effect. For example, you can use my PCAtools package in R / Bioconductor for that. If there is a strong batch, then correction is warranted; otherwise, you may not need to do anything and can just include 'batch' in your design formula for differential expression analysis (in DESeq2 or edgeR). On that note, I assume that you are using DEBrowser with DESeq2 or edgeR under the hood(?). If so, then that is fine.
In terms of criteria, your approach is basically fine. TMM is a good normalisation method for RNA-seq, and ComBat-seq is a good method for batch correction of count data (it is based on the original ComBat method but adapted for count data). Assigning a number to each dataset is a good proxy for 'batch', as each dataset likely represents a different lab / experimental batch. To assess if it is 'working', plot PCA bi-plots before and after correction - if the samples mix better after correction (i.e., no clear segregation by batch), then it is working. There is no single 'best' method, but other options include just including batch in the design formula during differential expression analysis (i.e., no explicit correction); otherwise, sva can be used to estimate surrogate variables for unknown batches, and these can then be included in the design formula, too.
For DEG analysis, you should use the batch-corrected counts as input to DESeq2 / edgeR. The original uncorrected counts will still contain the batch effects, which could bias your results. For GO enrichment, you can use the DEG results from the batch-corrected analysis. Just note that ComBat-seq produces adjusted counts that are suitable for input to DESeq2 / edgeR.
For comparing cell lines under control conditions only, I would recommend performing batch correction again on just the control samples. The global correction (on control + LPS) may have over- or under-corrected in ways that are not optimal for the control-only subset. Again, check PCA before / after to see the effect. If the batch effect is not strong in the controls alone, then you may not need to correct (or can just include batch in the design formula).
I looked at PCAs before and after correction DEBbrowser. Before correction, cell lines from one experiment clustered together irrespective of treatment (or rather origin). After correction, Control and LPS were separated and with some differences between the cell types. That tells me that the batch effect is minimized (or I think so). Am I correct in assuming that?
For DEGs, I have collected the batch corrected matrix from DEBbrowser and am doing the subsequent analysis with iDEP (DESeq2).
My big concern was batch correction for the control-only subset. I appreciate your help regarding that. I will also do a batch correction on the control-only subset. But from what I did so far, control only samples show clear separation from each other, but I am not sure if I can trust that, especially cos the two in-house data sets that I did in my lab cluster closer to each other. After correction, this is gone and all of the cell lines them kinda lump together.
Is lab confounded with cell line?
Hi. Thanks for your reply. Yes. Each cell line data comes from different labs.
Does it mean each lab contributes both LPS and control, or just one of each? If the latter then I would strongly recommend not do do this analysis, it's confounded and you would be chasing ghosts. If the former, you could simply use a design in your DE analysis
~treatment+labto adjust for the source of the cells.