I have some UNC Illumina RNAseqV2 data with about 100 genes, 800 patients with UNC ID. I'd like to find the subtype of each tumor (normal, luminal A, luminal B, basal, HER2) for a classifier. Preferably with the UNC ID but if TCGA barcode is provided I believe it's possible to match them up. I can't find it on TCGA website. Maybe just looking in wrong places.
You can use TCGAbiolinks to retrieve the list
source("http://www.bioconductor.org/biocLite.R") library(TCGAbiolinks) cancer <- "BRCA" PlatformCancer <- "IlluminaHiSeq_RNASeqV2" dataType <- "rsem.genes.results" pathCancer <- "TCGAData/miRNA" datQuery <- TCGAquery(tumor = cancer, platform = PlatformCancer, level = "3") lsSample <- TCGAquery_samplesfilter(query = datQuery) # get subtype information dataSubt <- TCGAquery_subtype(tumor = cancer) lumA <- dataSubt[which(dataSubt$PAM50.mRNA == "Luminal A"),1] allSamples <- lsSample$IlluminaHiSeq_RNASeqV2 #1218 total samples lumASamples <- allSamples[grep(x = allSamples, pattern = paste(lumA, collapse = "|"))] # 263 luminal samples found
You'll find a list derived from microrarrays (Nature 2012 release) at
It appears that there is no canonical PAM50 call set for the RNAseq version, leaving everyone to make their own calls (using the genefu package or some other means) and getting somewhat different results for the edge case tumors.