Comparing RNA-Seq data to online databases
1
0
Entering edit mode
2.7 years ago
Victoria • 0

Hi

I have RNA-Seq data from patients and downloaded online RNA-Seq control datasets. What do you recommend me to correct batch effects. What quality controls can I perform during my analyses? Also if the online data are FPKM and my data are raw counts, how should I approach that?

any thoughts would be appreciated!

RNA-Seq batch effects • 957 views
1
Entering edit mode
2.7 years ago
ATpoint 60k

If you do not have replicates of both groups in both of the batches there is nothing you can reliably do. There is no way to distinguish batch from biological effect. I know this is frustrating but downloading completely independent data while aiming to include them into one statistical analysis with your samples is not solid at all. If you don't believe me, go and download a couple of different RNA-seq datasets from independent studies for the same cell type, such as several blood cell datasets. Process the raw fastq files identically, combine into one dataset, normalize e.g. with DESeq2 and make a PCA, colored either by study or by cell type. You will see that (at least I always saw that so far) that there is a strong batch effect, so samples will notably cluster by study but not at all by cell type due to different lab protocols, kits, sequencing regimes etc. on top of the expected biological variations.

0
Entering edit mode

Thanks for your answer! I have also seen similar behaviour in my studies so far, where even the RNA prep day affects the PCA. But if I want to continue this analysis anyway, what do you suggest to take the most out of it? I have ERCC spike-ins in my samples.

0
Entering edit mode

I gave my opinion which is that it is neither recommended nor meaningful what you aim to do. Data analysis has limitations and expecting fully-confounded experiments to produce meaningful results is one of it. ERCC spike-ins will not change that as this will not correct intrinsic batch effects. The fact that you have raw counts and the processed data are FPKM does not help either, it makes it even worse as FPKM is not well-suited for comparing samples (please use the search function why that is, has been discussed extensively before).

0
Entering edit mode

So ATpoint, I was thinking about using a similar approach as you described in your example. Doing a quick test I figured that the batch effect is so huge, that not even correcting by Study was enough. Do you see the same in your analysis or maybe my toy example wasn't effective at all? In the end, it looks like we're just not there yet when it comes to comparing different batches of RNAseq datasets.. do you agree?