Cancer subtypes classifier integrate microarray and rnaseq data
1
0
Entering edit mode
9.9 years ago
juncheng ▴ 220

Hi,

I'm right now working on a cancer subtype classifier project. I had problem to make the classifier from microarray and RNAseq data agree with each other. In anyway the classifier build from the two data source have some how 20% disagreement. (I know because I have same common samples, and the two classifier classified 20% of them differently, also both classifier performed good on it's own data source).

RNA-Seq microarray cancer subtypes • 4.0k views
ADD COMMENT
0
Entering edit mode

I don't see a question here, just a statement of what you've done. I assume the question is either (A) why might the classifier give such discordant results when trained on the different data types or (B) how might you try to avoid this issue. In either case, please update your post so we know what your actual question is.

ADD REPLY
0
Entering edit mode

Thanks, I updated below.

ADD REPLY
0
Entering edit mode

Hi Devon Ryan,

thanks a lot.

Sorry for being ambiguous. Exactly, I think you point out both. First I'm surprised about the discordant, but can't find a way to solve so far.

ADD REPLY
0
Entering edit mode
Most of the time, when you don't get a useful answer at biostar within 24 hours it is because the question should be improved, not because biostar members don't know how to approach the question. There are many possible answers to the questions devon ryan formulated. I think you should start at the beginning and ask yourself if the pearson correlations between normalized microarray log2 intensities and rnaseq log2 FPKM values are good enough (> 0.95) to proceed. If they are not, you might want to optimize the preprocessing part (mapping, read summarization, normalization etc) before continuing to the classifying part. If they are, then ask how to make a classifier, then ask how to make predictions based on a classifier, then ask how to compare results of prediction models and if you have done all that, then we can discuss about the questions you asked. please don't feel offended, I'm just trying to help you.
ADD REPLY
0
Entering edit mode

Hi Irsan,

thanks for pointing out. It is very helpful indeed.

I got the log2 transformed microarray data and rnaseq expression data from public database. The correlation of the two dataset is around 0.72. The correlation between the two platform can really reach so high (0.95)? I cannot achieve a good agreement of classifiers seams likely because of the low correlation, am I right?

Rplot

Histogram plot of RNAseq data, log2(x+1) transformed. I filtered out non expressed genes already. However, a large peak at 0 still. This due to some genes are only expressed at one or two samples, 0 at most samples.

Rplot01

Histogram plot of microarray

Rplot06

ADD REPLY
0
Entering edit mode
good job! I havent compared rnaseq and microarray data before so 0.95 was just a guess. judging from the scatter plots I think especially the non expressed genes decrease the correlation, 0.72 including these non expressed genes seems relatively good agreement to me. i suspect that when you calculate the correlation for each gene between rnaseq and microarray and you make a histogram you will see 2 peaks, one for genes with low correlation, one with high correlation. The question is whether the bad genes are in your classifier. (btw, how did you classify?)
ADD REPLY
0
Entering edit mode

Thanks again. The gene correlation actually has only on peak. Based on your idea, I ranked the gene correlation, and choose only the high correlated genes for classification, but I cannot end with a good classifier with low error rate.

ps: how I did the classification

For both microarray and rnaseq classifier, I first log transform the preprocessed expression data, then do quantile normalization, Z transform. Then I select gene for consensus clustering, by this I use MAD value as a indicator, and a reasonable number of genes are selected. 

From the consensus clustering, I decided to classify the expression data into 4 groups. Based on the group annotation of clustering, I train a classifier by PAM. Of course, much less genes are selected from the gene set for clustering are used for training classifier (around 500).

ADD REPLY
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2429 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6