Heatmap of a transformed deseq object with pheatmap
2
0
Entering edit mode
4 months ago

I am trying to produce a heatmap to represent a contrast I performed on my deseq object. We have a relatively big experiment where samples are individual Drosophila in 10 different conditions. I am managing to produce the heatmap, but I am not managing to add the 2 levels of the conditions compared in my contrast on top of the heatmap as an annotation :

resIHW5_tidy<- results(dds, contrast=c("condition","F6","F0"), filterFun = ihw,alpha=0.05, lfcThreshold=1, altHypothesis="greaterAbs", tidy=TRUE) 
df_IHW5<- as.data.frame(resIHW5_tidy)
IHW5_sign = subset(df_IHW5, padj<=0.05)
rownames(IHW5_sign)=IHW5_sign$row ## now the row is the FlyBase accession number

# I rearranged the table to get the genes reordered by log2 fold change : 
res_log2FC <- arrange(IHW5_sign, desc(IHW5_sign$log2FoldChange))
# I created a vector of the gene accession numbers
genes <- res_log2FC$row
genes <- factor(genes, levels=genes)
1765 Levels: FBgn0000356 FBgn0052644 FBgn0031621 FBgn0000644 ... FBgn0036600

gene_names<-as.factor(res_log2FC$name) # a vector containing the gene names that I assigned to the accession numbers, but the problem is that there are several accession numbers with the same gene name so instead 

# So instead I added an other column in IHW5_sign which consists in a concatenation of the accession number and gene name and made it a factor :
genes_names<-paste(genes,"-",gene_names)
genes_names
[1763] "FBgn0051091 - uncharacterized protein"                                                       
[1764] "FBgn0005633 - flightin"                                                                      
[1765] "FBgn0036600 - uncharacterized protein"

genes_names<-factor(genes_names)

# I transformed my deseq object : 
vsd<-vst(dds,blind=FALSE)

colData names(10): library sample ... host-treat condition

colData(vsd)$condition <- as.factor(colData(vsd)$condition)

# filtering out the samples that are in my 2 levels of interest (the ones compared in my contrast) only from this new dataset : 
vsd_interest<-vsd[ , vsd$condition %in% c("F0", "F6") ]

# vsd_interest$condition, the output still shows me the 10 levels as factors, despite having only the 2 of them effectively represented in the new dataset. I am not sure how much of a problem this is though. 

samples_interest<-colnames(vsd_interest)
samples_interest
 [1] "3"   "6"   "8"   "14"  "46"  "47"  "59"  "60"  "74"  "75"  "80"  "86" 
[13] "95"  "99"  "102" "104" "116" "119" "137" "139" "143" "149" "150" "151"
[25] "172" "173" "188" "190" "195" "198" "209" "210" "211" "233" "235" "239"
[37] "242" "243" "460" "985"
# these are indeed my samples of interest in the 2 treatment levels of my contrast

#from my metadata file, called "sample_data.tsv" :
sample_data_interest<- sample_data[sample_data$condition =='F0' | sample_data$condition =='F6',]

samples<-sample_data_interest$library
samples
 [1] "3"   "6"   "8"   "14"  "46"  "47"  "59"  "60"  "74"  "75"  "80"  "86" 
[13] "95"  "99"  "102" "104" "116" "119" "137" "139" "143" "149" "150" "151"
[25] "172" "173" "188" "190" "195" "198" "209" "210" "211" "233" "235" "239"
[37] "242" "243" "460" "985"
# they do match the colnames(vsd_interest)

### Now the heatmap : 

pheatmap(assay(vsd_interest)[rownames(assay(vsd_interest)) %in% genes, colnames(assay(vsd_interest)) %in% samples], 
    cluster_rows=F, 
    cluster_cols=T, 
    cellwidth = 10,cellheight=8.6,
    color=colorRampPalette(c("#f9cb9c","#a7146a"))(n = 100),
    show_colnames=T,
    show_rownames = T,
    labels_row=genes_names,
    heatmap_legend_param = list(title = "Counts", 
                                direction = "horizontal",
                                title_position = "topcenter"))

Here is the top of the heatmap (because it is very very long) : enter image description here

It would be perfect if I could add the levels of the condition for each sample at the very top, where the dendrogram is. I tried :

fcond<-as.factor(colData(vsd_interest)$condition)

# this still gives me the 10 levels, despite not being represented in vsd_interest since it is a subset

pheatmap(assay(vsd_interest)[rownames(assay(vsd_interest)) %in% genes, colnames(assay(vsd_interest)) %in% samples], 
    cluster_rows=F, 
    cluster_cols=T, 
    cellwidth = 10,cellheight=8.6,
    color=colorRampPalette(c("#f9cb9c","#a7146a"))(n = 100),
    show_colnames=T, # Show the sample names
    show_rownames = T,
    labels_row=genes_names,
    annotation_col = fcond,
    heatmap_legend_param = list(title = "Counts", 
                                direction = "horizontal",
                                title_position = "topcenter"))

Error in `[.default`(annotation_col, colnames(mat), , drop = F) : 
  incorrect number of dimensions

# So I tried taking the condition levels from my subsetted metadata file instead :

fcond2<-as.factor(sample_data_interest$condition)

# and use this instead of fcond in the previous code, but with the same outcome.

I would really appreciate it if someone could point out the very naive mistake I am making.

deseq2 pheatmap • 1.7k views
ADD COMMENT
2
Entering edit mode
4 months ago
SamGG ▴ 150

The annotation_col must be a data.frame, not a simple vector, one row per sample. The rows must have a name, which will permit the association between the annotation and the expression matrix. Each column of the expression matrix must have a name. Currently, as you set show_colnames=T but nothing is printed, I guess colnames(expression_matrix) is NULL.

Here is some untested code I would start with, and this code replaces all your code starting at the following line:

# vsd_interest$condition, the output still shows me the 10 levels as factors, despite having only the 2 of them effectively represented in the new dataset. I am not sure how much of a problem this is though. 
annot_col <- data.frame(condition = factor(colData(vsd_interest)$condition))  # factor will drop unused levels

(samples_interest<-colnames(vsd_interest))
# these are indeed my samples of interest in the 2 treatment levels of my contrast
rownames(annot_col) <- samples_interest

#from my metadata file, called "sample_data.tsv" :
# they do match the colnames(vsd_interest)
# you don't need to reload any information from an external source
# the object in your hands has all the information

### Now the heatmap : 

expr_matrix <- assay(vsd_interest)[rownames(assay(vsd_interest)) %in% genes, colnames(assay(vsd_interest)) %in% samples]

# this still gives me the 10 levels, despite not being represented in vsd_interest since it is a subset
# solved above

pheatmap(expr_matrix, 
    cluster_rows=F, 
    cluster_cols=T, 
    cellwidth = 10,cellheight=8.6,
    color=colorRampPalette(c("#f9cb9c","#a7146a"))(n = 100),
    show_colnames=T, # Show the sample names
    show_rownames = T,
    labels_row=genes_names,
    annotation_col = annot_col,
    heatmap_legend_param = list(title = "Counts", 
                                direction = "horizontal",
                                title_position = "topcenter"))

You probably should center each row to visualize the logFC between F0 and F6. Alternatively, if F0 is the reference condition, you could compute the mean using the F0 samples and remove this mean value from the whole row (aka gene). After viewing this heatmap, you could decide to scale each row, or not.

ADD COMMENT
0
Entering edit mode

I think you very very much for your help. It worked. I think I will have to take off the legend and add it separately though, because I am not finding how to change the position with 'pheatmap' : enter image description here

F0 is indeed the reference condition in the contrast. Centering each row would mean that I would represent only F6 then ? I like showing the variation between samples of all treatments, but this might just be too messy.

ADD REPLY
1
Entering edit mode

Glad it helps you. You probably can remove the legend and color scale, or if you save the figure in PDF or SVG (I don't know whether SVG is possible), you can move it to a better place. But what is strange is that on the first figure the color scale was really away from the heatmap.

Centring each line on the mean of the F0 samples means that the F0 and F6 values will be represented. The centred F0 values will be close to zero and their fluctuation around zero will give an idea of their noise level. Thus, all F0 samples will show expression levels around zero for each gene. Centred F6 values (relative to the mean of F0) will be more directly readable than when centred on the mean of the F0 and F6 samples.

The code should look like:

ref_means <- rowMeans(expr_matrix[, annot_col$condition == "F0"])
cent_expr_matrix <- expr_matrix - ref_means
cent_expr_breaks <- seq(-3, +3, length = 101)

Then you know how to draw the heatmap. Just be careful a) to revert to the default color scale which is a dual gradient color scale, and b) to make it centered on zero and strictly symmetric (ie from -3 to +3) by setting breaks = cent_expr_breaks in the pheatmap call. Last note, if expr_matrix has rownames (which I advise), there is no need to set labels_row.

When the expression is centred on the reference group, the colour of the heat map is really the logFC. When centred on the whole line made up of two balanced groups, the color represents only half the logFC (taking zero as the reference).

When scaling (i.e. dividing by sd or mad), the color no longer strictly represents the logFC.

Would you be kind enough to post the F0 centered heatmap as an example of this point of view? Thanks.

ADD REPLY
0
Entering edit mode
pheatmap(cent_expr_matrix, breaks = cent_expr_breaks,
    cluster_rows=F, 
    cluster_cols=T, 
    cellwidth = 10,cellheight=8.6,
    color=colorRampPalette(c("#4fa85d","#f9cb9c","#a7146a"))(n = 100),
    show_colnames=T,
    show_rownames = T,
    labels_row=genes_names,
    annotation_col = annot_col,
    heatmap_legend_param = list(title = "log2FC", 
                                direction = "horizontal",
                                title_position = "topcenter"))

He rownames of expr_matrix are the accession numbers only, I feared it would be too messy otherwise, so I kept using labels_row. Green is downregulated compared to the mean of the reference level, yellow close to 0, purple upregulated.

Heatmap sentered on the mean of the reference level

This time the legend is lost somewhere further down the heatmap, which is annoying. Also, I only just noticed that my reference group is on the left and my control on the right, I should find a way to change this.

I ordered the genes by log2FC value, but it maybe I should rather cluster the lines.

ADD REPLY
1
Entering edit mode

rownames(expr_matrix) <- paste(genes,"-",gene_names) will avoid to pass labels_row and allows to keep custom names synchronized with the data (de-synchronization is something I fear).

The words reference and control have a similar meaning, at least to me, so I didn't fully understand your last sentence until I read the heat map. The reference is the condition that is used to calculate the average, in this case F0. The samples from the reference condition are therefore clearly on the right, with colours around ‘yellow’. They were already on the right in the first heat map. The compared condition (sometimes called the test, here condition F6) is on the left. I don't know how to reverse the order on the heatmap, but I wouldn't be interested. As soon as I know how to read the heatmap and/or the conditions legend, I know how to catch the message.

Row clustering will help building gene groups. So that a good option to try.

Nevertheless, I am a bit surprised by the logFC ordering, as I feel that some genes have a low color contrast, e.g. Juvenilie hormone at 2/3 of the list or the FBgn0051926 at 1/4.

As a matter of taste, I am not a fan of the colors you used and I prefer the default ones.

Finally, if you want to have more tuning options, you should have a look at the complexHeatmap package.

ADD REPLY
0
Entering edit mode

I made I mistake I meant "the reference on the right, the group that was compared to it on the left". I think you are right about the ordering, I might have done something wrong. I will probably change the colors yes :) I think the yellow is not bright enough. I am probably going to use the complexHeatmap package in the end, but I likes the simplicity of this one. You mentioned being able to change the position of the legend if I saved as PDF, could you tell me more ? Thanks again !

ADD REPLY
0
Entering edit mode

When you save a plot in a vector format such as SVG or PDF, there will be objects that can be moved around as units. You can literally drag the legend to any part of the plot. If you don't have Adobe Illustrator or Corel Draw, Inkscape is free and can also manipulate vector files.

ADD REPLY
0
Entering edit mode

...I decided to try 'ComplexHeatmap' :

ha = HeatmapAnnotation(df = annot_col, which="column",show_annotation_name = FALSE,show_legend=FALSE)

Heatmap(cent_expr_matrix, name = "Log2FC", km = 5, col = colorRamp2(c(-3, 0, +3), c("#00aba9", "white", "#ff0097")),
    top_annotation = ha, column_split=annot_col$generation, show_row_names = TRUE, show_column_names = TRUE, show_column_dend=FALSE,show_row_dend=FALSE, width = unit(0.8, "snpc"))

enter image description here

It is easier than I thought it would be, but I am not finding how to use genes_names as row annotations. It is also a bit frustrating that I cannot choose the size of the cells.

I wish there was a way to visualize samples that had 0 counts, because there were quite a lot of them for some genes. I suppose now they appear as the most intense shade of blue.

ADD REPLY
0
Entering edit mode

Please use ADD REPLY/ADD COMMENT when responding to existing posts. ADD ANSWER is only for new answers to original question.

ADD REPLY
1
Entering edit mode
4 months ago

You probably want to center and scale this data, don't you?

You also might want to do a droplevels after you've subsetted, to remove the factors you don't actually have anymore.

Are you following this post about how to do this? I think you need a little more to your fcond:

pheatmap annotation - legend only for columns

ADD COMMENT
0
Entering edit mode

Thank you very much ! Yes I guess I should center and scale. I am always uneasy about doing this, but this would make everything more readable.

ADD REPLY
1
Entering edit mode

It has nothing to do with readability but with transforming data to emphasize differences rather than absolute expression levels, see Scaling RNA-Seq data before clustering?

ADD REPLY

Login before adding your answer.

Traffic: 4083 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6