I am working with RNA seq data. We made 7 plates of SMARTseq runs on different months. We have 4 plates coming from 1 type of cell and 3 plates coming from another type of cell. 5 of those plates look good after QC and analysis, while the other 2 batches show a clear batch effect (it was confirmed by several packages in R, in pca it also clearly seen). When we compute for DEG we can use edgeR package to include the design, and like that, we can try to remove the batch effect while predicting DEGs. Neverthless, my goal was also to use machine learning to predict which genes are driving the difference between cell type 1 and 2. I was thinking about trying the random forest and also SVM. But, it is hard to compare the 2 types of cells if we have a huge batch effect. I was wondering wether it is more trustworthy to do it on non corrected data (prediction is around 70%) or is it better to first try to remove the batch effect? If the latter, what would be the best case? Use ComBat to correct for the mean and variance?
We came up with around 60 DEG between the 2 cells types. When I make a PCA, with prcomp, on the df that has 60 genes and 300 cells (which belong to a specific group) I get the following plot: (This data was normalized with all cells)
This is without scaling:
This is with scaling:
Normally one has to scale the data before computing pca but I am a bit sceptical that after scaling the plots look like there is no batch effect. The SD is the same. I am thinking maybe the influence of the genes that are affected by batch effect is reduced but I wouldnt expect such a dramatic difference. Am I thinking right?
Thanks in advance