Question: Best Way To Normalize Rna-Seq Data For Expression Profiling
gravatar for ajc8
6.8 years ago by
University of Iowa
ajc8120 wrote:


I have transcript counts from RNA-seq data. There are three samples, and they are biological replicates of a cell line. My goal is to provide a ranked list of expressed genes, with some sort of expression quantifier for each gene/transcript.

I am wondering the best way to normalize the data - just calculate RPKM values and then remove outliers per transcript? Or should I perform some sort of upper quantile normalization? If so, what is the best way to do this?

Thank you for your help!

normalization rna-seq • 36k views
ADD COMMENTlink modified 5.3 years ago by Amitm1.9k • written 6.8 years ago by ajc8120

I am probably stating the obvious, but RNA-seq is not a measure of absolute expression. It is closer to absolute expression than microarrays (probably), but comparisons between genes (ranking by expression) should be taken with multiple, large grains of salt.

ADD REPLYlink written 6.8 years ago by Sean Davis26k

RPKM between replicates should be fine, but if you want to optimize check: Optimal Scaling of Digital Transcriptomes

ADD REPLYlink written 6.8 years ago by JC11k
gravatar for rtliu
6.8 years ago by
New Zealand
rtliu2.1k wrote:

Please check out Professor Lior Pachter keynote in Genome Informatics 2013 meeting at CSHL:

Quoted from slide page 45

"The problem with FPKM:

Although abundances in FPKM are proportional to the relative abundances, the proportionality constant is experiment specific.

Li and Dewey go back to the basics in the RSEM paper (BMC Bioinformatics, 2011). Instead of RPKM/FPKM, why not use a universal proportionality constant? Instead of *, they propose TPM

Please use TPM in your papers!"

ADD COMMENTlink written 6.8 years ago by rtliu2.1k

Lior Pacther's slides and presentation are helpful. Here is another another helpful video from one of his students, with even more explicit discussion about FKPM and TPM:

ADD REPLYlink written 4.6 years ago by Kamil2.0k

Although I theoretically agree with this, I think assigning the reads in the right transcripts is not trivial (whereas gene-level quantification are very consistent among different tools).

As read length increases and coverage becomes more even, I might change my mind, but I think RPKM is currently the most practical strategy and tools using RPKM measurements (such as limma) are known to provide accurate results. For example, you can check out these differential expression comparisons:

"In general, limma performed well under many circumstances in the present comparisons, being also computationally fastest to run."

"We find significant differences among the methods, but note that array-based methods adapted to RNA-seq data perform comparably to methods designed for RNA-seq."

*Recommend Partek or DESeq over edgeR or cuffdiff

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Charles Warden7.9k
gravatar for Amitm
5.3 years ago by
Amitm1.9k wrote:

Sorry to bump an old thread but I have been struggling with non-replicate RNA-seq data for long. So, in relation to the previous post, I don't think conversion to Z-scores is a valid idea.

Z-score assumes an underlying Normal distribution which is not at all the case with RNA-seq data.

ADD COMMENTlink written 5.3 years ago by Amitm1.9k
gravatar for Charles Warden
6.8 years ago by
Charles Warden7.9k
Duarte, CA
Charles Warden7.9k wrote:

RPKM should be fine.

Sometimes I've seen samples with different median RPKM values, but trying to median center (or quantile normalize) the data ended up making the resulting gene lists worse (in regards to functional enrichment and positive controls).

ADD COMMENTlink written 6.8 years ago by Charles Warden7.9k
gravatar for vj
6.8 years ago by
vj450 wrote:

If your goal is to just rank the genes based on their expression, then you could try converting them to z scores. You can use the "scale" funtion in R to do it.

ADD COMMENTlink written 6.8 years ago by vj450

I'm not a statistician, but I remember that z scores assume normal distribution of data. Right? I think RNA-seq doesn't follow normal distribution and hence you will have a significant error in your estimation.

ADD REPLYlink written 4.4 years ago by microbe7730
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 883 users visited in the last hour