Question

Qs on merging microarray datasets and different ways on building a classifier on them.

0

Entering edit mode

6.4 years ago

arronar ▴ 280

Hello.

I'm interested to build a classifier for a merged dataset that is going to be a combination from the following ones:

dataset 1 : 2x Control , 2x Treatment1, 2x Treatment2
dataset 2 : 3x Control , 3x Treatment4 , 3x Treatment5
dataset 3 : 2x Control , 2x Treatment6, 3x Treatment7

That means that the final dataset will be:

7xControl , 2x Treatment1, 2x Treatment2, 3x Treatment4 , 3x Treatment5 , 2x Treatment6, 3x Treatment7

As you can see the only sample that the final dataset is balanaced for, is the Control. Because Control was the only common sample between the 3 initial datasets.

First of all I know that this classifier is not going to be very powerful since the final dataset isn't balanced but I have some questions and I would like to read your opinions/suggestions.

Q1 : Under normal conditions where the final dataset is balanced, in the literature, researchers use many different ways to remove the so called batch effect. Such ways is the median centering and the comBat method. In these normal from a balance perspective, cases you can run a PCA or a hierarchical clustering and see the difference before and after batch removal withing your samples. But is this necessary for my case ? Is it reasonable to run such methods in that unbalanced dataset or I can run a quantile normalization in the whole dataset without any batch removal method ?

Q2 : The last question is about classifying such datasets with different methods (Random forests , SVM , etc). I realized that at some articles researchers first find out the deferentially expressed genes and then run the classifiers on them while other are running the classifying algorithms on all of the data without first finding the DEGs. Is there any real difference between those two approaches ? Does anything change from the respect of statistics/machine learning ?

Any advice, opinion, hint is welcome.

Thank you.

microarray classifier merge R machine learning • 1.7k views

ADD COMMENT • link updated 6.4 years ago by svlachavas ▴ 790 • written 6.4 years ago by arronar ▴ 280

score 1 · Answer 1 · 2017-11-18

Dear arronar,

a very interesting question, but quite challening and with many approaches and a very long discussion possible, and highly dependent on your biological question. I will try to recap quickly:

1) What kind of expression datasets you have ? microarrays ? RNA-Seq ? both ?

2) What is the percentage of similarity regarding the platforms ? For instance, same cell lines? same microarray platforms ? As you mentioned, control samples are only the same for the 3 datasets ?

3) Concering the concept of machine learning: what is your main rationale of merging the 3 datasets ? that is, you want to develop a binary classifier, or a multilabeled one ? Or more generally, a gene-signature with high discriminatory efficiency ? Because, if you have different drug combinations or pertubagens, i dont get the point of merging the datasets. Or if you could provide more insights on this matter, the necessity of merging the datasets would be more evident.

4) Regarding the batch effect correction. Usually, quantile normalization would not solve the problem, especially for machine learning approaches. You have still want to correct for batch effect, after of course performing some diagnostic plots, like PCA, MDS, hierarchical clustering, etc. For example, first preprocess each dataset and normalize it separately, then merge on common gene identifiers, and then batch effect correction, as one general approach.

5) Moreover, the choise of the feature selection it is up to you. You could start by using as an initial pool your DE genes, perform another feature selection methodology, such as an entropy based criterion, lasso penalty regression and many others.

6) Also, keep in mind the fundamental issue of class imbalance, It would be a common problem, if you have many samples in one class, and much few on the other classes. Some algorithms could handle this, but then you would have to implement approaches to cope with class imbalance, such as resampling.

To summarize, you could have a detailed look in the following excellent paper:

https://academic.oup.com/nar/article/43/12/e79/2902606

It is very related to your post, and might provide you with more information on this matter

Hope that helps,

Efstathios