Subtyping on TCGA BRCA RNAseqV2 data with PAM50
Entering edit mode
7.2 years ago
jxiang15 ▴ 30


I have TCGA breast cancer RNAseq V2 data and I would type to find the subtypes using the PAM50 gene set. After reading multiple posts, I'm still confused about the process.

It seems like the following is recommended for my problem.

PAM50Preds<-intrinsic.cluster.predict(sbt.model=pam50, data=dataset, annot=dannot, do.mapping=TRUE, verbose=TRUE)


However, I have the following questions.

  1. Is the model pam50 trained on microarray data, and thus need to be refitted for RNA-seq data?
  2. Because I have new data, do I first need to use intrinsic.cluster first to fit the model before prediction?

Basically, I want to check whether for my data, I can simply just plug in the model from genefu and predict or if there is a step that needs to come before.


RNA-Seq cancer TCGA genefu PAM50 • 6.7k views
Entering edit mode
7.1 years ago
ldetorrente ▴ 40


I had similar questions and did some digging so:

  1. Yes it was trained on microarray data so ideally you would prefer to refit it for RNA-seq data. The best, I guess would be to train it on the TCGA dataset and then use the result on another dataset. Now if your interest is directly on the TCGA, then I don't know which datasets you could use for training.

  2. I tend not to use the genefu package because of the following reason: If you look into the different function, you'll see that there is actually a scaling part. In your case, using the function directly it just use a robust scaling. But the question is how do you know it's what is best? I wrote an email to the maintainer of the package a few months ago and he admitted that the scaling question is very important and that it would be better to stick with Parker (the one that started PAM50) method. His scaling method is not yet implemented in genefu so better to use Parker's function which you can find here. If you want more information, you can also look here where you have the original dataset Parker used to train the PAM50 classifier.

There is actually already a paper about classifying the TCGA Breast cancer RNA-seq with PAM50 here. I wrote them a few months ago to have the classification of all the samples and they were very fast at answering. So do not hesitate to write directly to Prof Charles Perou if you just need the result.

Entering edit mode

I've been thinking about this recently, for your information, here is what I've found,

The PAM50 was actually first trained on qRT-PCR data,, but this paper seems to have also used microarray data, together with qRT-PCR data, for clustering with 189 breast tumors across 1,906 “intrinsic” genes

Gene Set Reduction Using Prototype Samples and qRT-PCR A minimized gene set was derived from the prototypic samples using the qRT-PCR data for 161 genes that passed FFPE performance criteria established in Mullins et al.21 Several minimization methods were used, including top “N” t test statistics for each group,22 top cluster index scores,23 and the remaining genes after “shrinkage” of modified t test statistics.24 Cross-validation (random 10% left out in each of 50 cycles) was used to assess the robustness of the minimized gene sets. The “N” t test method was chosen due to having the lowest cross-validation (random 10% left out of each iteration) error. The 50 genes selected and their contribution to distinguishing the different subtypes is provided in Appendix Figure A2 (online only).

The 2015 TCGA Breast cancer paper, did adjustment of the RNA-Seq data first, then applied PAM50,, in the SI,

To determine breast cancer intrinsic subtypes based on the PAM50 signature, first,
the TCGA mRNA-seq data were subsampled to match the ER distribution of the
training set used for the PAM50. Second, the entire TCGA 817 data set was adjusted
to the median gene expression calculated for the PAM50 genes determined from the ER balanced subset; intrinsic subtyping was then done as previously described (Cancer Genome Atlas, 2012).

But the 2012 breast cancer paper (Cancer Genome Atlas, 2012) used microarray data, I've found inconsistency between the two (~10%). I contacted Dr Perou, he commented that there are always inconsistencies in subtyping between different platforms, which is surprising to me.

Entering edit mode


Thanks a lot for this information, it's really useful.

I contacted Prof Charles Perou but haven't yet received a response. I was wondering if you could please share the PAM50 classification for the whole BRCA cohort? It would be greatly appreciated! All I'd need is a list of IDs and PAM50 subtypes, so it should be a tiny file.

Many thanks in advance! A

Entering edit mode

For anyone interested, the PAM50 classification is reported in the Supplementary Data of the 2015 TCGA Breast cancer paper. For an even more comprehensive list of TCGA Breast cancer patients PAM50 classification you can get it from this paper using the R Package TCGABiolinks

brca_subtypes <- TCGAbiolinks::TCGAquery_subtype("brca")

Login before adding your answer.

Traffic: 2515 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6