Question

Heatmaps: Why and How to use them

2

Entering edit mode

7.3 years ago

hougiotaejut ▴ 30

I have a time course study in which there are 4 time points and each time point has 4 biological replicates. My first aim was to find the DE gens during time.

Please consider it that I'm at a very basic level on analyzing RNA-seq data, that's why my questions seem very simple to you.

I have read somethings about heatmaps. As I know, heatmap is a good tool to analyze the data visually (since the data is high dimensional) and to cluster the genes. I also know that, the gene expression values need to be normalized to avoid systematic biases and there are some useful R packages to draw heatmaps.

What I don't know is that:

Why do you cluster the genes in heatmaps at all?
in my case where I have the normalized read counts of a time course RNA-seq study, with 4 biological replicates and 4 time points (4 conditions) how should I draw a heatmap? Should I draw the heatmaps for each time point (condition) separately? or should I draw all biological replicates and time points at the same time? which means 16 replicates in a row.
The number of DEG is about 13000. How should I decide how to cluster them? Should I only draw heatmaps for DEGs or should I do it for all of the genes?
I have done the normalization by methods such as TMM, Median, Quantile or Total. Is it essential to use FPKM or RPKM values?
I also read that there is no best method for choosing distance measures in cluster analysis. Does it mean whatever method I choose would be up to me?

I have read some information on biostars which I can refer you to, but none of them could help me understand the answers to my questions:

Heatmap Of Rna-Seq Data
Hierarchical Clustering And Heatmap Analysis For Rna-Seq Data
Clustering Data (Rna-Seq) Using R To Produce A Heatmap

and some other similar posts.

Thanks a lot.

****UPDATE****

I just read that I have to merge the table of my read counts and the table of FRD-adjusted P-values for each gene in one table, and then draw the heat maps for most significant genes

heatmaps RNA-seq DEG • 6.3k views

ADD COMMENT • link updated 7.3 years ago by mforde84 ★ 1.4k • written 7.3 years ago by hougiotaejut ▴ 30

score 3 · Answer 1 · 2017-01-05

3

Entering edit mode

7.3 years ago

mforde84 ★ 1.4k

Good questions:

1) They look important, i.e., sexy from a publishing standpoint. Whereas in reality, they're mostly just a hodge podge of data if not properly organized and clustered.

2) How you organize will be largely dependent on aesthetic outcome, especially if you end up trying to cluster rows and columns on the heatmap. You'll just have to try a few ways, and see which works best for you and the presentation of the data. Overall, people tend to prefer simpler heatmaps.

3) Best approach would be to do some sort of pathway or GO enrichment analysis first. I've been using GOexpress in R, which can be used directly to generate heatmaps. Also, it's unlikely that you have 13k DE genes. You will have to apply a filter. Most people use a hard cutoff of 2+ fold increase, or 50%- decrease. Though this is just a suggestion. For GO analysis, you may want to use a different filter, for example number or proportion of significantly enriched genes associated with a term ontology is a good place to start.

4) From my own experience, you want to normalize counts for heteroskedasticity across samples for each gene feature. However, I would first suggest doing some initial normalization of the raw counts, for example in DESeq theres library size correction, etc.

# use case for DESeq pipeline
exprs <- counts(cds,normalized=TRUE)
transpose_expr <- t(exprs)
scale_transpose <- scale(transpose_expr)
exprs <- t(scale_transpose)

5) I don't think anyone is really going to question it either way, but you can do additional validations. e.g., https://cran.r-project.org/web/packages/RankAggreg/vignettes/RankAggreg.pdf

ADD COMMENT • link 7.3 years ago by mforde84 ★ 1.4k

0

Entering edit mode

Thank you very much. I'm trying to compare two models. So I haven't changed the default settings for each model and that is why they give me a high number of DE genes. I want to figure out how close these two models work by default by comparing their results on finding the index genes and their number of overlapped genes. But the most significant genes are less than 8k. Still too much, huh?

I'm going to plot the heatmap for the top 100 DEGs by their FDR-adj.p-values<0.05 seperately for each model.

To practice, I plotted two heatmaps using two different commands to see what happens though I still have problems interpreting these maps.

ADD REPLY • link 7.3 years ago by hougiotaejut ▴ 30

2

Entering edit mode

Yea, 8k seems like a lot. I don't think it's out of the realm of possibility, but it's unlikely. I'm assuming that most are marginally DE. I'll typically do a default analysis (pooled-CR, i believe in deseq2), then filter by adjusted pval and log fold change either >2 or less than 0.5.

Make sure you're scaling the gene features like in the code example above. Otherwise, it may be difficult to see DE in the heatmap, because some gene targets may be very highly or lowly expressed and throw off the coloring scale. For example, say I have 2 DE genes. One has a baseline of 1000, and is increased to 2000. The other has a baseline of 10, and is increase to 20. If you don't scale, then the first DE gene will be all red, and the second will be all blue, because the scale range for the heatmap is effectively 10-2000.

ADD REPLY • link 7.3 years ago by mforde84 ★ 1.4k

0

Entering edit mode

One of the models provides me with the log2fold changes and p-values, but the other one doesn't. So I don't know how to further filter the DEGs by log2fold change for the second model. it only gives me the DEGs, and their PPs (posterior probabilty).

and thanks for mentioning the code above. I took a look at the manual of DESeq. I will scale my normalized read counts matrix. The info were really useful. Please if there is anything else I need to know, let me know. Thanks

ADD REPLY • link 7.3 years ago by hougiotaejut ▴ 30

1

Entering edit mode

hm, that's a thinker. ill have to search around and see how other people are filtering assuming a bayesian framework. do you have priors for your genes as well? honestly ill have to admit, this isn't exactly an area of my expertise.

ADD REPLY • link 7.3 years ago by mforde84 ★ 1.4k

0

Entering edit mode

I once was blamed to ask simple questions (I know I'm not knowledgeable enough), so I don't dare to use my former account because I don't want to be blamed again. I made a new account but it seems I can't comment more than 5 times a day. In order to answer your question, I had to log in with my former account.

This is a Bayesian approach applied in EBSeq-HMM for time course studies. According to what they say, "Under a target FDR = 0.05, we call genes with PP(remain constant)< 0.05 as DE genes".

PP: posterior probability of being each expression path

FDR: I'm not sure but I guess that's FDR-adjusted P-values.

anyway, the output is a matrix with columns: DE genes' names, their most likely expression path (like UP-UP-UP), PP.

I'm not sure what you mean by priors. I just know I have prior probabilities of being each path as well.

I think DEGs with highest PPs, are the most significant ones. Because when you ask the model to give you the head genes, it gives you the genes with the highest PPs.

ADD REPLY • link 7.3 years ago by statfa ▴ 760

1

Entering edit mode

sorry to hear about your bad experience. ive always like biostars because i think the community is much more reasonable and nice. imagine asking a simple question on stack exchange, jesus they would murder us all if they could. :)

ill look around and see if i cant find any filtering criteria for EBSeq-HMM, as im always interested in learning new methods.

ADD REPLY • link 7.3 years ago by mforde84 ★ 1.4k

0

Entering edit mode

Yes, I like biostars. People here are knowledgeable and kind. And you're right about stack exchange :)

Thank you again for your help. I'll appreciate that.

ADD REPLY • link 7.3 years ago by statfa ▴ 760

0

Entering edit mode

I couldn't find the thread in which you claim to be blamed for asking simple questions (which are in my opinion not bad). But you shouldn't be blamed for those, there is no shame in asking simple questions provided that you show you have put an effort in the topic yourself.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Here on this thread: C: Sending files to Galaxy

Maybe that wasn't a blame and I was just sensitive. I will use my real account again.

ADD REPLY • link 7.3 years ago by statfa ▴ 760

1

Entering edit mode

I think you misinterpreted that comment, I see it as a helpful suggestion.

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

After scaling the normalized counts, my heatmap has changed significantly.

ADD REPLY • link 7.3 years ago by hougiotaejut ▴ 30

1

Entering edit mode

yea, it's just trial and error. i ran into the exact same issue when i started with this type of analysis. happy to help.

ADD REPLY • link 7.3 years ago by mforde84 ★ 1.4k

0

Entering edit mode

Sorry, I just noticed that if I transpose my matrix of read counts and then scale them, the result differs from when I don't transpose my matrix. Why have you transposed the matrix? Updated I found the answer. You can either scale the matrix by row or column. "pheatmap" package in R can scale the matrix and draw the heatmap.

ADD REPLY • link 7.3 years ago by statfa ▴ 760

score 1 · Answer 2 · 2017-01-05

1

Entering edit mode

7.3 years ago

WouterDeCoster 47k

To answer your question 3: it's rather pointless to visualise thousands of genes because that won't tell you anything. You have to make a selection of top hits.

For question 4: any normalisation will probably do.

ADD COMMENT • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you very much for the info

ADD REPLY • link 7.3 years ago by hougiotaejut ▴ 30