Question: Qs on merging microarray datasets and different ways on building a classifier on them.
gravatar for arronar
17 months ago by
arronar160 wrote:


I'm interested to build a classifier for a merged dataset that is going to be a combination from the following ones:

dataset 1 : 2x Control , 2x Treatment1, 2x Treatment2
dataset 2 : 3x Control , 3x Treatment4 , 3x Treatment5
dataset 3 : 2x Control , 2x Treatment6, 3x Treatment7

That means that the final dataset will be:

7xControl , 2x Treatment1, 2x Treatment2, 3x Treatment4 , 3x Treatment5 , 2x Treatment6, 3x Treatment7

As you can see the only sample that the final dataset is balanaced for, is the Control. Because Control was the only common sample between the 3 initial datasets.

First of all I know that this classifier is not going to be very powerful since the final dataset isn't balanced but I have some questions and I would like to read your opinions/suggestions.

Q1 : Under normal conditions where the final dataset is balanced, in the literature, researchers use many different ways to remove the so called batch effect. Such ways is the median centering and the comBat method. In these normal from a balance perspective, cases you can run a PCA or a hierarchical clustering and see the difference before and after batch removal withing your samples. But is this necessary for my case ? Is it reasonable to run such methods in that unbalanced dataset or I can run a quantile normalization in the whole dataset without any batch removal method ?

Q2 : The last question is about classifying such datasets with different methods (Random forests , SVM , etc). I realized that at some articles researchers first find out the deferentially expressed genes and then run the classifiers on them while other are running the classifying algorithms on all of the data without first finding the DEGs. Is there any real difference between those two approaches ? Does anything change from the respect of statistics/machine learning ?

Any advice, opinion, hint is welcome.

Thank you.

ADD COMMENTlink modified 17 months ago by svlachavas560 • written 17 months ago by arronar160
gravatar for svlachavas
17 months ago by
svlachavas560 wrote:

Dear arronar,

a very interesting question, but quite challening and with many approaches and a very long discussion possible, and highly dependent on your biological question. I will try to recap quickly:

1) What kind of expression datasets you have ? microarrays ? RNA-Seq ? both ?

2) What is the percentage of similarity regarding the platforms ? For instance, same cell lines? same microarray platforms ? As you mentioned, control samples are only the same for the 3 datasets ?

3) Concering the concept of machine learning: what is your main rationale of merging the 3 datasets ? that is, you want to develop a binary classifier, or a multilabeled one ? Or more generally, a gene-signature with high discriminatory efficiency ? Because, if you have different drug combinations or pertubagens, i dont get the point of merging the datasets. Or if you could provide more insights on this matter, the necessity of merging the datasets would be more evident.

4) Regarding the batch effect correction. Usually, quantile normalization would not solve the problem, especially for machine learning approaches. You have still want to correct for batch effect, after of course performing some diagnostic plots, like PCA, MDS, hierarchical clustering, etc. For example, first preprocess each dataset and normalize it separately, then merge on common gene identifiers, and then batch effect correction, as one general approach.

5) Moreover, the choise of the feature selection it is up to you. You could start by using as an initial pool your DE genes, perform another feature selection methodology, such as an entropy based criterion, lasso penalty regression and many others.

6) Also, keep in mind the fundamental issue of class imbalance, It would be a common problem, if you have many samples in one class, and much few on the other classes. Some algorithms could handle this, but then you would have to implement approaches to cope with class imbalance, such as resampling.

To summarize, you could have a detailed look in the following excellent paper:

It is very related to your post, and might provide you with more information on this matter

Hope that helps,


ADD COMMENTlink modified 17 months ago • written 17 months ago by svlachavas560

First of all thank you for your answer. I've already read that article and I really liked it. It's very informative and is the article that gave me to understand that there is not only the way of finding DEGs and then run classifier on them but you can just use the whole dataset. The difference now is that in that paper they had a lot of samples (hundreds of them) while i have only 22 of them and elastic net is not gonna work on that small dataset.

But let me answer your questions now:

1) I thought that I've mentioned it but I was wrong ( I just added in the tags). My datasets are from microarrays.

2) All datasets are from Affimetrix and only one is from illuminas' platform. As for the cell lines (good question by the way) while for example in one dataset there might be 3 control samples, 2 samples for treatment1 on cell line A and 2 samples for treatment1 on cell line B, I thought not to take in mind the cell lines (Just use as class the treatment), and see if in the end the classifier will classify the different cell lines together due to the same treatment but actually in a PCA they are going to be shown as two different clusters (don't know if I explained it well). As i said in my initial post, only the control samples are the same for the three datasets. What I mean is that all controls (at all three datasets) are diseased cells (they have the same disease). But each dataset use a different drug for this disease and thus I said, that if I merge those three into one dataset , is going to be unbalanced.

3) The initial concept is to build a multilabeld classifier that will be able to classify the different treatments. The first thought was to find all the DEGs (after doing what you describe to your number 4) and use them as my initial pool. Later I could use these DEGs to find possible side effects, or common targets between them etc. But then by reading articles like the one you posted (that they don't calculate the DEGs), plus the fact that the merged dataset will be so unbalanced, I came here and asked this question to take some ideas and see what people do/suggest.

Thank you again for your instant answer. I hope to answered to some of your questions.

ADD REPLYlink written 17 months ago by arronar160
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1515 users visited in the last hour