Cancer subtypes classifier integrate microarray and rnaseq data
1
0
Entering edit mode
8.1 years ago
juncheng ▴ 200

Hi,

I'm right now working on a cancer subtype classifier project. I had problem to make the classifier from microarray and RNAseq data agree with each other. In anyway the classifier build from the two data source have some how 20% disagreement. (I know because I have same common samples, and the two classifier classified 20% of them differently, also both classifier performed good on it's own data source).

RNA-Seq microarray cancer subtypes • 3.4k views
0
Entering edit mode

I don't see a question here, just a statement of what you've done. I assume the question is either (A) why might the classifier give such discordant results when trained on the different data types or (B) how might you try to avoid this issue. In either case, please update your post so we know what your actual question is.

0
Entering edit mode

Thanks, I updated below.

0
Entering edit mode

Hi Devon Ryan,

thanks a lot.

Sorry for being ambiguous. Exactly, I think you point out both. First I'm surprised about the discordant, but can't find a way to solve so far.

0
Entering edit mode
0
Entering edit mode

Hi Irsan,

thanks for pointing out. It is very helpful indeed.

I got the log2 transformed microarray data and rnaseq expression data from public database. The correlation of the two dataset is around 0.72. The correlation between the two platform can really reach so high (0.95)? I cannot achieve a good agreement of classifiers seams likely because of the low correlation, am I right?

Histogram plot of RNAseq data, log2(x+1) transformed. I filtered out non expressed genes already. However, a large peak at 0 still. This due to some genes are only expressed at one or two samples, 0 at most samples.

Histogram plot of microarray

0
Entering edit mode
good job! I havent compared rnaseq and microarray data before so 0.95 was just a guess. judging from the scatter plots I think especially the non expressed genes decrease the correlation, 0.72 including these non expressed genes seems relatively good agreement to me. i suspect that when you calculate the correlation for each gene between rnaseq and microarray and you make a histogram you will see 2 peaks, one for genes with low correlation, one with high correlation. The question is whether the bad genes are in your classifier. (btw, how did you classify?)
0
Entering edit mode

Thanks again. The gene correlation actually has only on peak. Based on your idea, I ranked the gene correlation, and choose only the high correlated genes for classification, but I cannot end with a good classifier with low error rate.

ps: how I did the classification

For both microarray and rnaseq classifier, I first log transform the preprocessed expression data, then do quantile normalization, Z transform. Then I select gene for consensus clustering, by this I use MAD value as a indicator, and a reasonable number of genes are selected.

From the consensus clustering, I decided to classify the expression data into 4 groups. Based on the group annotation of clustering, I train a classifier by PAM. Of course, much less genes are selected from the gene set for clustering are used for training classifier (around 500).

0
Entering edit mode
8.1 years ago
raunakms ★ 1.1k