Question: Cancer subtypes classifier integrate microarray and rnaseq data
gravatar for juncheng
5.3 years ago by
juncheng190 wrote:


I'm right now working on a cancer subtype classifier project. I had problem to make the classifier from microarray and RNAseq data agree with each other. In anyway the classifier build from the two data source have some how 20% disagreement. (I know because I have same common samples, and the two classifier classified 20% of them differently, also both classifier performed good on it's own data source).

ADD COMMENTlink modified 5.3 years ago by raunakms1.1k • written 5.3 years ago by juncheng190

I don't see a question here, just a statement of what you've done. I assume the question is either (A) why might the classifier give such discordant results when trained on the different data types or (B) how might you try to avoid this issue. In either case, please update your post so we know what your actual question is.

ADD REPLYlink written 5.3 years ago by Devon Ryan91k

Thanks, I updated below.

ADD REPLYlink written 5.3 years ago by juncheng190

Hi Devon Ryan,

thanks a lot.

Sorry for being ambiguous. Exactly, I think you point out both. First I'm surprised about the discordant, but can't find a way to solve so far.

ADD REPLYlink written 5.3 years ago by juncheng190
Most of the time, when you don't get a useful answer at biostar within 24 hours it is because the question should be improved, not because biostar members don't know how to approach the question. There are many possible answers to the questions devon ryan formulated. I think you should start at the beginning and ask yourself if the pearson correlations between normalized microarray log2 intensities and rnaseq log2 FPKM values are good enough (> 0.95) to proceed. If they are not, you might want to optimize the preprocessing part (mapping, read summarization, normalization etc) before continuing to the classifying part. If they are, then ask how to make a classifier, then ask how to make predictions based on a classifier, then ask how to compare results of prediction models and if you have done all that, then we can discuss about the questions you asked. please don't feel offended, I'm just trying to help you.
ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Irsan7.0k

Hi Irsan,

thanks for pointing out. It is very helpful indeed.

I got the log2 transformed microarray data and rnaseq expression data from public database. The correlation of the two dataset is around 0.72. The correlation between the two platform can really reach so high (0.95)? I cannot achieve a good agreement of classifiers seams likely because of the low correlation, am I right?


Histogram plot of RNAseq data, log2(x+1) transformed. I filtered out non expressed genes already. However, a large peak at 0 still. This due to some genes are only expressed at one or two samples, 0 at most samples.


Histogram plot of microarray



ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by juncheng190
good job! I havent compared rnaseq and microarray data before so 0.95 was just a guess. judging from the scatter plots I think especially the non expressed genes decrease the correlation, 0.72 including these non expressed genes seems relatively good agreement to me. i suspect that when you calculate the correlation for each gene between rnaseq and microarray and you make a histogram you will see 2 peaks, one for genes with low correlation, one with high correlation. The question is whether the bad genes are in your classifier. (btw, how did you classify?)
ADD REPLYlink written 5.3 years ago by Irsan7.0k

Thanks again. The gene correlation actually has only on peak. Based on your idea, I ranked the gene correlation, and choose only the high correlated genes for classification, but I cannot end with a good classifier with low error rate.

ps: how I did the classification

For both microarray and rnaseq classifier, I first log transform the preprocessed expression data, then do quantile normalization, Z transform. Then I select gene for consensus clustering, by this I use MAD value as a indicator, and a reasonable number of genes are selected. 

From the consensus clustering, I decided to classify the expression data into 4 groups. Based on the group annotation of clustering, I train a classifier by PAM. Of course, much less genes are selected from the gene set for clustering are used for training classifier (around 500).

ADD REPLYlink written 5.3 years ago by juncheng190
gravatar for raunakms
5.3 years ago by
Vancouver, BC, Canada
raunakms1.1k wrote:

Hope this paper will be useful for you A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae.

ADD COMMENTlink written 5.3 years ago by raunakms1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1842 users visited in the last hour