Remove batch effect for prediction of RNA-seq data
Entering edit mode
4.8 years ago
Vlad ▴ 10

Hello all,

I am working with RNA seq data. We made 7 plates of SMARTseq runs on different months. We have 4 plates coming from 1 type of cell and 3 plates coming from another type of cell. 5 of those plates look good after QC and analysis, while the other 2 batches show a clear batch effect (it was confirmed by several packages in R, in pca it also clearly seen). When we compute for DEG we can use edgeR package to include the design, and like that, we can try to remove the batch effect while predicting DEGs.

Nevertheless, my goal was also to use machine learning to predict which genes are driving the difference between cell type 1 and 2. I was thinking about trying the random forest and also SVM. But, it is hard to compare the 2 types of cells if we have a huge batch effect. I was wondering whether it is more trustworthy to do it on non corrected data (prediction is around 70%) or is it better to first try to remove the batch effect? If the latter, what would be the best case? Use ComBat to correct for the mean and variance?

We came up with around 60 DEG between the 2 cells types. When I make a PCA, with prcomp, on the df that has 60 genes and 300 cells (which belong to a specific group) I get the following plot:

(This data was normalized with all cells)

This is without scaling:

This is with scaling:

enter image description here

Normally one has to scale the data before computing pca but I am a bit sceptical that after scaling the plots look like there is no batch effect. The SD is the same. I am thinking maybe the influence of the genes that are affected by batch effect is reduced but I wouldn't expect such a dramatic difference. Am I thinking right?

Thanks in advance

RNA-Seq batch-effect Random-Forest • 2.2k views
Entering edit mode

Hi, please see How to add images to a Biostars post, I made the changes this time.

Entering edit mode

I will check, thanks

Entering edit mode

Normally one has to scale the data before computing pca

Hi, I'm interesting in learning more about this point. Would you happen to have any references that support this statement?

Entering edit mode


There are several opinions regarding scaling. In some cases you have to scale and in some maybe not. Regarding the gene expression profile, the genes are normally scaled before pca. This is a standard pipeline of for example Seurat.

The scaling is important for methods that work with distances, such as SVM.

Here is one useful example.

For randomForest you do now need to scale since its not based on distances.


Login before adding your answer.

Traffic: 2093 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6