remove Batch Effect between several **RNA-seq** studies (no replication)
1
0
Entering edit mode
3.9 years ago
snstab • 0

Hi. How can I remove Batch Effect between several RNA-seq studies that have count data (cpm) available using Combat? The number of control and treatment samples varies between studies, and there is no replication in one study that has only Two samples (one control sample and one treatment sample). This study is very important and I cannot remove it. How can I combine non-repetitive study with other studies? Please help me if anyone knows the solution. Please help me. Thank you very much

RNA-Seq • 3.3k views
ADD COMMENT
0
Entering edit mode

If that one study is very important and you cannot remove it, then how can you comment on or correct the batch effect? There is only a single batch.

ADD REPLY
0
Entering edit mode

Thanks for your reply. I want to remove Batch Effect between several RNA-seq studies that one of them has only Two samples. My question is, how do I combine this unreplicated study with other studies? I mean, how do I set a command for this situation?؟

ADD REPLY
0
Entering edit mode

If you cannot remove that study, does it mean it has information that the other studies don't? In that case, that study has 2 different variables (that and the batch), so you don't know how much each is contributing.

ADD REPLY
0
Entering edit mode

yes it has information that the other studies dont have. this study has 2 samples(1 treat and 1 control) and these samples have no replication. and now my question is how to correct batch effects between studies that one of them has this situation. As far as I've seen and researched, in order to correct the batch effect between different studies, all studies must have examples in which the study has been repeated. My question is, can I correct the batch effect between studies, that one of which has these conditions? And if I can, how do I set a command for this study? thanks

ADD REPLY
6
Entering edit mode
3.9 years ago
ATpoint 82k

I do not understand why people always assume that random (and even unreplicated) studies can meaningfully be merged. RNA-seq is strongly confounded by study as beyond the biological variation the choice of kits for RNA extraction, reverse transcription and library preparation kit have a notable influence on the inferred transcript/gene abundances = counts.

I would rather perform analysis on every study and then perform a meta-analysis, e.g. using ranks as in this paper. The idea of ranks is that one calculates a ranked significance, e.g. signed fold change * p-value and then compares these ranks per studies for all genes. Significantly upregulated genes get high ranks, significantly downregulated genes get low ranks, non-significant genes get intermediate ranks. If a gene is reproducibly up- or downregulated across studies then it should consistently be assigned a high or low rank and therefore get a low p-value in the meta-analysis.

Based on my understanding this has several advantages:

  1. You don't have to bother with batch correction and choice of parameters which can or cannot influence or alter the true biological effects.
  2. You can even use an underpowered / unreplicated study if this one is really so important as you say, e.g. analyzed with NOI-seq as it eventually only matters which ranks the genes have per study. One single study with n=1 of course does not give any robustness but if you combine several studies (including studies with sufficient replicates) and then still find genes with consistent high ranks (or low ranks in the case of downregulation) then the results can still be powerful and reliable. Still, robust meta-analysis tools will limit the influence of a single study, so even if that one study is flawed in terms of replication and/or data quality, you could still obtain significant results from the meta.
  3. You can easily extend the analysis if at some point if a new study shall be included as the rank calculation itself, e.g. with the tool that the linked paper developed, is fast and it does not change the analysis result of each individual study, therefore the analysis effort is limited.

The above points are based on my (limited) experience with meta-analysis, so feel free to comment if you (dis)agree.

ADD COMMENT
0
Entering edit mode

Hi. Thank you for your reply and comment. Incidentally, I was very keen on having p-value data on the path to meta-analysis, but unfortunately the Noi-seq output was probability and not p-value. So I had to use a Noi-seq because some of my studies were unreplicated. Eventually I had to keep up with counts data. Do you have an idea for converting probability(Noi-seq output) to p-value?

ADD REPLY
0
Entering edit mode

All you need for a ranking is a measure of confidence that are gene is differential or not and a fold change. This you convert into any kind of metric that ranks the genes by confidence and direction of change so that positive fold changes with high confidence rank highest and negative FCs with high confidence rank lowest. It does not matter if this is called p-value, probability, log-odds score or similar. You only have to rank the results and then put all studies into the meta-analysis.

ADD REPLY
0
Entering edit mode

Thank you very much for your comments. Do your answers mean that 1) Replication of samples in each study is not an important issue and without repeating the samples in each study, the batch effect between several studies can be corrected ? 2) Even with the counts or probabilities, can the batch be removed without being converted? In Section 8 of the SVA Guide in Bioconductor, there are commands for working with RNA-seq counts, but these commands contain the replication option, and I don't know how to change the command for non-repeatable studies?

ADD REPLY
0
Entering edit mode

1) Replication of samples in each study is not an important issue

I clearly said that in traditional pairwise analysis, if you make claims based on a single study it is essential to have replicates. If you combine multiple studies by meta-analysis then the power of meta-analysis could in parts compensate for the lack of power of each individual study. That is why I said you could try to still incoroporate the unreplicated study into the meta-analysis. If you don't do meta then you cannot use the study for differential expression. I clearly do not recommend using underpowered studies alone!

2) Even with the counts or probabilities, can the batch be removed without being converted?

The whole point of my answer, which I think is clearly expressed, is that you should not do any batch correction. A batch effect is an unwanted technical variation between samples and/or studies that corrupt your results. That is why you should analyze each sample independently (to save you from any batch effects between studies) and then use the meta to see if there is consistency between the results that you can build a hypothesis from. I do not recommend any batch correction and/or merging between studies especially in RNA-seq since experiments are strongly confounded by various technical factors as stated in my answer.

but these commands contain the replication option

Conclusion: Don't do batch correction with underpowerd studies (or at all).

ADD REPLY
0
Entering edit mode

Thank you for your kindness in answering my questions. With respect

ADD REPLY
0
Entering edit mode

ATpoint I was really fortunate to have bumped into this post and especially your reply. I decided to try the RobustRankAggreg tool in R to do the meta-analysis. Can you pls tell me if the process that I adopted was correct or not. I performed DEG analysis on three separate RNA-seq datasets. So I had three lists of DEGs. I separated the up-and down-regulated genes first and ranked them in decreasing order (for upregulated) and increasing order (down-regulated) of fold change respectively. I did this because the documentation of the method states that

glist = list of element vectors, the order of the vectors is used as the ranking

So I had one list of all ranked up-regulated genes and one list of all ranked down-regulated genes. Then I used the aggregateRanks() on each list to obtain the output.

ADD REPLY

Login before adding your answer.

Traffic: 1766 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6