Question

Select patients for cancer classifier

1

Entering edit mode

9.9 years ago

juncheng ▴ 220

I read a paper on cancer subtype classification of Glioblastoma. After they do unsupervised clustering on the gene expression data, the actually select sample (patients) that have positive silhouette score for supervised classification. From my idea, I don't think this a good approach or even wrong. The classifier might heavily overfitted with the samples (patients) one selected.

What your idear? Is it correct to choose training samples based on there performance of unsupervised clustering?

paper link

Cancer-classification machine-learning • 2.4k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by juncheng ▴ 220

1

Entering edit mode

Do you have a link to the paper? Without more details it is hard to comment. In general, yes, if a subset of samples with strong class distinctions were selected for training, that sounds like it could be a dubious approach, likely to perform poorly. At the end of the day, any classifier needs to be validated on a suitable independent dataset that was not used for training. Do the authors of the paper show that validation?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Ahill ★ 1.9k

0

Entering edit mode

Thanks, see the update for paper links.

It is a quit established paper. The training and CV data all from a consensus clustering of 202 samples. It is impossible to "validate" directly on an independent dataset, because "Y" response (subtype) of the data comes from consensus clustering. What they did for validation is a heatmap showing the genes the selected for classifier. The heatmap of the validation data looks similar to the training data (see figure 2 of the paper).

ADD REPLY • link 9.9 years ago by juncheng ▴ 220

Ram · Answer 1 · 2014-06-11

Having scanned this paper, the approach is reasonable and not wrong. Nothing wrong with using a silhouette statistic to select representative samples here. Figure 2B makes the case that over-fitting did not happen here. This is a class discovery approach. I agree it's not possible to unambiguously "validate" on a test set much beyond what the authors have done, since there is no accepted external definition of the subtypes. They make a good case that in the separate validation set (Figure 2B), the distribution of subtypes is similar to their core sample set, and more importantly that there are correlations of subtypes with genetic and other clinically relevant factors, also in a validation set that was not used for definition of the subtypes, if I read correctly. The subtypes appear to provide a clinically or scientifically useful class assignment. Figure 4 suggests scientific utility, and Figure 5 a possible clinical utility (e.g. Proneural patients do not respond differentially to aggressive therapy).