Question

Gene Expression Pam Classification Reproducibility Question

0

Entering edit mode

10.1 years ago

Mattias Aine ▴ 620

I'm working on recreating the classification of a tumor set using pam in R.

I have a data set obtained from the authors of a recent study.

They perform consensus clustering (ConsensusClusterPlus-package) to derive stable subtypes and use that classification for deriving a classification gene signature using pam.

Using CCP with parameters from the paper I can get a 2-group split with the right number of tumors in both clusters (no RNG-seed was reported in the paper though).

When I use that cluster-split for training with the threshold-parameter from the paper, I get back the correct gene signature with all parameters exactly equal to those published in the supplement of the paper in question.

Using the pamr.predict-function on the data I can also get cluster designations for each tumor sample from pam.

However the paper shows a cross-table of the CCP-cluster designations and pam-designations, and these do not agree with what I see. The CCP-samples are seemingly right, but the pam-classification is off by 4 samples.

Is pam not a completely deterministic classifier for a given threshold or is there something I have missed?

Are there parameters downstream of fixing the cutoff-parameter (number of discriminating genes) that influence the cluster designations?

It is unlikely that another 2-group CCS-soultion for training would be the right answer as that would change the pam derived gene-signature. To be sure I ran CCP 500-times with different RNG-seeds to see how many alternate solutions with the "right" number of tumors per cluster were out there and the answer was 1 other (6/500 runs). That one did not reproduce the right gene-signature in pamr.

I also used the centroids of the pam-genes from the full data and tried nearest-neighbor classification using Person, Spearman and Euclidean distance, but no method reproduces the publication crosstable.

It is important for me that I can reproduce the exact clustering results from the paper in question which is why I obtained the data from the authors, they didn't however include any clustering-calls for individual samples.

I guess the next step is to bug the authors a bit more, but I wanted to check first if I have missed something very obvious.

r cancer classification gene-expression • 2.8k views

ADD COMMENT • link updated 9.4 years ago by Biostar 20 • written 10.1 years ago by Mattias Aine ▴ 620