Question: trying to reproduce heat map made via cBioportal using download data
0
gravatar for chrisclarkson100
11 days ago by
European Union
chrisclarkson10050 wrote:

I need to make a statistical comparison using breast cancer data. I have made a heat map at the following link on the Bioportal: http://www.cbioportal.org/index.do?cancer_study_id=brca_tcga&Z_SCORE_THRESHOLD=2.0&RPPA_SCORE_THRESHOLD=2.0&data_priority=0&case_set_id=brca_tcga_mrna&gene_list=BRCA1%250ABRCA2%250ATP53%250ACCL2%250ACCR3%250ACD44%250AENG%250AIL6%250AIL33%250ACD33%250ACSF1%250AHIF1A%250ACLEC7A&geneset_list=%20&tab_index=tab_visualize&Action=Submit&genetic_profile_ids_PROFILE_MRNA_EXPRESSION=brca_tcga_mrna_median_Zscores&show_samples=false&heatmap_track_groups=brca_tcga_mrna_median_Zscores%2CBRCA1%2CBRCA2%2CTP53%2CCCL2%2CCCR3%2CCD44%2CENG%2CIL6%2CIL33%2CCD33%2CCSF1%2CHIF1A%2CCLEC7A

enter image description here

To do this I initially went to the http://www.cbioportal.org page and selected the cancer samples that I am interested in: Breast Invasive Carcinoma (TCGA, Provisional)

I then went to the textbook to enter the names of the genes that I am interested in: BRCA1 BRCA2 TP53 CCL2 CCR3 CD44 ENG IL6 IL33 CD33 CSF1 HIF1A CLEC7A

Then I submitted the query and was able to plot a clustered heat map (by clicking 'Heatmap>>Add genes to heatmap>>cluster heatmap') in the "oncoprint" tab.

While this type of annotation is useful, I would also like to be able to download the data that is generating this heat map and make a similar heat map on my local

computer (for experimental reasons).

To attempt this I went to the original search page and clicked the "View summary" button.

From this I found a "Download Data" button at the top of the page.

This returns a 'tar.gz' file with lots of interesting datasets. e.g.:

  • data_mRNA_median_Zscores.txt

  • data_expression_median.txt

  • data_RNA_Seq_v2_mRNA_median_Zscores.txt

  • data_RNA_Seq_v2_expression_median.txt

I want to find the expression data that was used to generate the histogram shown in the first provided link. From the downloaded files, I initially

tried data_RNA_Seq_v2_expression_median.txt

Below is my attempt to reproduce a heatmap similar to the one above:

data=read.table('data_RNA_Seq_v2_expression_median.txt',header=T,fill = T,stringsAsFactors = F)
genes_OI=c("BRCA1","BRCA2","TP53","CCL2","CCR3","CD44","ENG","IL6","IL33","CD33","CSF1","HIF1A","CLEC7A")

data_OI=data.frame()
for(i in genes_OI$V1){
 data_OI=rbind(data_OI,data[which(data[,1]==i),])
}
sumis.na(data_OI))
library(gplots)
png('test_TCGA_patients.png',height = 1000,width=1000)
data_OI[,-c(1,2)]=apply(as.matrix(data_OI[,-c(1,2)]), 2, as.numeric)

data=na.omit(data)
heatmap.2(as.matrix(data_OI[,-c(1,2)]),labCol = NA,
         labRow = data_OI[,1],cexRow = 1.4,keysize = 1.4)
dev.off()

The resulting heatmpat is as follows:

enter image description here

But this is not at all like the heatmap in the link at the top of the page.... is there some normalisation step that I am missing?

I also tried using the file where the the Zscores were computed: data_RNA_Seq_v2_mRNA_median_Zscores.txt

This file however (just from looking at the file content does not require any distance matrix):

data=read.table('data_RNA_Seq_v2_mRNA_median_Zscores.txt',header=T,fill = T,stringsAsFactors = F)
genes_OI=c("BRCA1","BRCA2","TP53","CCL2","CCR3","CD44","ENG","IL6","IL33","CD33","CSF1","HIF1A","CLEC7A")

data_OI=data.frame()
 for(i in genes_OI$V1){
     data_OI=rbind(data_OI,data[which(data[,1]==i),])
 }
png('test_TCGA_patients.png',height = 1000,width=1000)
data_OI[,-c(1,2)]=apply(as.matrix(data_OI[,-c(1,2)]), 2, as.numeric)

data=na.omit(data)

hclustfunc <- function(x, method = "complete", dmeth = "euclidean") {    
  hclust(dist(x, method = dmeth), method = method)
}
rc<-hclustfunc(data_OI[,-c(1,2)])
cd=t(data_OI[,-c(1,2)])
cc<-hclustfunc(cd)
heatmap(as.matrix(data_OI[,-c(1,2)]), Rowv=as.dendrogram(rc), 
        Colv=as.dendrogram(cc),labRow = data_OI[,1],labCol = NA)
dev.off()

This unfortunately does not produce anything similar to the heat map seen in the link above....

Hence I am wondering were it is that I am going wrong....? Am I using the correct file or is there

a normalisation step that I am missing?

rna-seq • 111 views
ADD COMMENTlink modified 11 days ago by Kevin Blighe21k • written 11 days ago by chrisclarkson10050

The cBioPortal group only refers to the part at the bottom as the heatmap. The main figure is an 'oncoprint', with some form of a heatmap at the bottom.

Take a look here: OncoPrint

ADD REPLYlink modified 11 days ago • written 11 days ago by Kevin Blighe21k

Hi sorry forgot to add step about "clicking 'Heatmap>>Add genes to heatmap>>cluster heatmap". The question has been modified necessarily.

ADD REPLYlink written 11 days ago by chrisclarkson10050

I have posted an answer for you - see below

ADD REPLYlink written 11 days ago by Kevin Blighe21k

See: How to add images to a Biostars post to understand how to embed images in a post. I've made the necessary changes for now.

ADD REPLYlink written 11 days ago by Ram15k

thanks this has been amended

ADD REPLYlink written 11 days ago by chrisclarkson10050
1
gravatar for Kevin Blighe
11 days ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

Hello again,

I see - thank you for clarifying.

Can you do me a favour and plot a histogram of the data from data_RNA_Seq_v2_expression_median.txt ? I assume that it's just the normalised counts, whereas the other file ( data_RNA_Seq_v2_mRNA_median_Zscores.txt ) represents the Z-scores of these normalised counts. Just use:

hist(data)

or

hist(data.matrix(data))

If you use the data from the first file (data_RNA_Seq_v2_expression_median.txt), you are virtually guaranteed a better heatmap if you do the following:

Read in data

data <- read.table('data_RNA_Seq_v2_expression_median.txt',header=T,fill = T,stringsAsFactors = F)
genes_OI <- c("BRCA1","BRCA2","TP53","CCL2","CCR3","CD44","ENG","IL6","IL33","CD33","CSF1","HIF1A","CLEC7A")

Set rownames

rownames(data) <- data[,1]
data <- data.matrix(data[,-1])

Select out genes of interest

data_OI <- data[which(rownames(data) %in% genes_OI),]

Pick a color scheme and set breaks

require("RColorBrewer")
myCol <- colorRampPalette(c("darkblue", "black", "darkred"))(100)

Set coluor 'breaks' at -2 and +2 Z-score

myBreaks <- seq(-2, 2, length.out=101)

Transform your data to the Z-scale

heat <- t(scale(t(data_OI)))

Plot the heatmap, specify your custom breaks and colour scheme, and switch further scaling (by heatmap.2) off

library(gplots)
pdf('test_TCGA_patients.pdf', width=7, height=5)
  heatmap.2(heat,
    col=myCol,
    breaks=myBreaks,
    main="Title",
    key=T, keysize=1.0,
    scale="none",
    density.info="none",
    reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
    trace="none",
    labRow
    cexRow=1.0,

    cexCol=0.2,
    distfun=function(x) dist(x, method="euclidean"),
    hclustfun=function(x) hclust(x, method="ward.D2"))
dev.off()

You can switch off the dendrograms by adding the following parameters:

heatmap.2(..., dendrogram="none", Rowv=FALSE, Colv=FALSE, ...)

If you want to do this with the data_RNA_Seq_v2_mRNA_median_Zscores.txt data, then just skip the heat <- t(scale(t(data_OI))) command.

This code is untested because I don't actually have the files in front of me. It should work, though. Some of your functions, like for selecting out genes and then converting to a data matrix, looked overly complex...

Kevin

ADD COMMENTlink written 11 days ago by Kevin Blighe21k

Hi thank you. You're answer is certainly an improvement on mine. enter image description here https://ibb.co/giCkcy

As for creating the histogram:

    head(data[,c(1:4)])
   Hugo_Symbol Entrez_Gene_Id TCGA.3C.AAAU.01 TCGA.3C.AALI.01
1 LOC100130426      100130426          0.0000          0.0000
2     UBE2Q2P3      100133144         16.3644          9.2659
3     UBE2Q2P3      100134869         12.9316         17.3790
4    LOC149767          10357         52.1503         69.7553
5       TIMM23          10431        408.0760        563.8934
6        MOXD2         136542          0.0000          0.0000

      hist(data.matrix(data[,-c(1,2)]))

this returns the following histogram

I did this but the resulting distribtion is very skewed

enter image description here https://ibb.co/eEs1Hy

ADD REPLYlink modified 11 days ago • written 11 days ago by chrisclarkson10050

Okay, so, it just looks like normalised RNA-seq counts on the negative binomial scale.

I would use the Z-scaled data and skip my scale() command, as I mentioned.

Also, if you 'squish' the PDF, you can make it look elongated like the cBioPortal image.

For example:

pdf(filename, width=9, height=2.5)
   ...
dev.off()
ADD REPLYlink written 11 days ago by Kevin Blighe21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 977 users visited in the last hour