Question: Remove batch effect for prediction of RNA-seq data
gravatar for Vlad
15 months ago by
Vlad10 wrote:

Hello all,

I am working with RNA seq data. We made 7 plates of SMARTseq runs on different months. We have 4 plates coming from 1 type of cell and 3 plates coming from another type of cell. 5 of those plates look good after QC and analysis, while the other 2 batches show a clear batch effect (it was confirmed by several packages in R, in pca it also clearly seen). When we compute for DEG we can use edgeR package to include the design, and like that, we can try to remove the batch effect while predicting DEGs. Neverthless, my goal was also to use machine learning to predict which genes are driving the difference between cell type 1 and 2. I was thinking about trying the random forest and also SVM. But, it is hard to compare the 2 types of cells if we have a huge batch effect. I was wondering wether it is more trustworthy to do it on non corrected data (prediction is around 70%) or is it better to first try to remove the batch effect? If the latter, what would be the best case? Use ComBat to correct for the mean and variance?

We came up with around 60 DEG between the 2 cells types. When I make a PCA, with prcomp, on the df that has 60 genes and 300 cells (which belong to a specific group) I get the following plot: (This data was normalized with all cells)

This is without scaling:

This is with scaling: enter image description here

Normally one has to scale the data before computing pca but I am a bit sceptical that after scaling the plots look like there is no batch effect. The SD is the same. I am thinking maybe the influence of the genes that are affected by batch effect is reduced but I wouldnt expect such a dramatic difference. Am I thinking right?

Thanks in advance

ADD COMMENTlink modified 15 months ago by ATpoint42k • written 15 months ago by Vlad10

Hi, please see How to add images to a Biostars post, I made the changes this time.

ADD REPLYlink written 15 months ago by ATpoint42k

I will check, thanks

ADD REPLYlink written 15 months ago by Vlad10

Normally one has to scale the data before computing pca

Hi, I'm interesting in learning more about this point. Would you happen to have any references that support this statement?

ADD REPLYlink written 15 months ago by _r_am31k


There are several opinions regarding scaling. In some cases you have to scale and in some maybe not. Regarding the gene expression profile, the genes are normally scaled before pca. This is a standard pipeline of for example Seurat.

The scaling is important for methods that work with distances, such as SVM.

Here is one useful example.

For randomForest you do now need to scale since its not based on distances.

ADD REPLYlink modified 15 months ago by _r_am31k • written 15 months ago by Vlad10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1926 users visited in the last hour