Question: Draw Heatmap Or Do Pca Analysis With Raw Read Counts?
gravatar for camelbbs
7.7 years ago by
camelbbs670 wrote:


I want to ask a question about viewing RNAseq data with raw read counts. After I get the raw reads counts from HTseq-count or similar tools, how do I normalize it. I can use " counts(cds,normalized=T) " in DESeq to get the normalized data. But It still need to be normalized by gene length, right?

Do I need to use RPKM generated from cufflinks to draw a heatmap or perform PCA analysis? Can raw reads data do that?



rna-seq • 7.8k views
ADD COMMENTlink modified 7.7 years ago by Irsan7.2k • written 7.7 years ago by camelbbs670
gravatar for Obi Griffith
7.7 years ago by
Obi Griffith18k
Washington University, St Louis, USA
Obi Griffith18k wrote:

This is a good question and I look forward to reading anyone else's answer. My thought is that you certainly can do a PCA analysis and create heatmaps (presumably you mean with the typical hierarchical clustering performed) on raw read counts. But, you must interpret them within that context. If your libraries are of similar depth then maybe normalizing for read depth won't matter that much. And, if your PCA or heatmap/clustering analysis is mostly focused on the relationship between samples then normalizing for gene size won't matter as much. However, if libraries have dramatically different depths this could certainly affect your clustering results (although that will heavily depend on what kind of distance metric you use). Similarly, if you are interested in how genes relate to each other you probably will want to normalize for gene size. Calculating an RPKM matrix from your raw read counts is very easy. Why not run both (raw, RPKM, and maybe some other normalization schemes) through your heatmap and PCA analysis and compare the results with the above caveats in mind. It will probably be educational and teach you something about your data.

ADD COMMENTlink written 7.7 years ago by Obi Griffith18k

Thanks. As you said, I think for our case, normalization with read depth is necessary and normalization with gene size is not. I will try both.

ADD REPLYlink written 7.7 years ago by camelbbs670
gravatar for Irsan
7.7 years ago by
Irsan7.2k wrote:

It has been suggested that normalization by calculating rpkm is not enough because gc content can be sample specific and that longer genes have lower variance between samples and therefore generate lower p values in significance testing. Have a look at this paper about rna seq normalization. But its definitely worth it to just try all possibilities and make some diagnostic plots.

ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Irsan7.2k

thanks, It seems normalization of rnaseq data is a complex question. if rpkm is not good, then how to normalize raw read counts data to make them reasonable in heatmap. Maybe the R package in this paper works good.

ADD REPLYlink modified 7.7 years ago • written 7.7 years ago by camelbbs670

Just try all of them;no normalization, normalizing for transcript length, normalizing for transcript length and total mapped reads in sample, normalizing with package N1...NX and see if you can discover any biases towards GC content, sequencing lanes, transcript length, ...??? But the obvious expression differences will be clear without extensive normalization see dont spend months just to make your analysis increase from a A++ to an A+++

ADD REPLYlink written 7.7 years ago by Irsan7.2k

Sorry I know this is an old thread. Trying everything can make it too easy to fall into confirmation bias. There really ought to be a good reason to try the approaches that you run.

ADD REPLYlink written 6.6 years ago by Adamc640
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1127 users visited in the last hour