I'm interested to build a classifier for a merged dataset that is going to be a combination from the following ones:
dataset 1 : 2x Control , 2x Treatment1, 2x Treatment2 dataset 2 : 3x Control , 3x Treatment4 , 3x Treatment5 dataset 3 : 2x Control , 2x Treatment6, 3x Treatment7
That means that the final dataset will be:
7xControl , 2x Treatment1, 2x Treatment2, 3x Treatment4 , 3x Treatment5 , 2x Treatment6, 3x Treatment7
As you can see the only sample that the final dataset is balanaced for, is the Control. Because Control was the only common sample between the 3 initial datasets.
First of all I know that this classifier is not going to be very powerful since the final dataset isn't balanced but I have some questions and I would like to read your opinions/suggestions.
Q1 : Under normal conditions where the final dataset is balanced, in the literature, researchers use many different ways to remove the so called batch effect. Such ways is the median centering and the comBat method. In these normal from a balance perspective, cases you can run a PCA or a hierarchical clustering and see the difference before and after batch removal withing your samples. But is this necessary for my case ? Is it reasonable to run such methods in that unbalanced dataset or I can run a quantile normalization in the whole dataset without any batch removal method ?
Q2 : The last question is about classifying such datasets with different methods (Random forests , SVM , etc). I realized that at some articles researchers first find out the deferentially expressed genes and then run the classifiers on them while other are running the classifying algorithms on all of the data without first finding the DEGs. Is there any real difference between those two approaches ? Does anything change from the respect of statistics/machine learning ?
Any advice, opinion, hint is welcome.